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Abstract 


In this paper we analyze boosting algorithms 15 21 24 in linear regression from a new 


perspective: that of modern first-order methods in convex optimization. We show that classic 
boosting algorithms in linear regression, namely the incremental forward stagewise algorithm 
(FS e ) and least squares boosting (LS-BooSTp)), can be viewed as subgradient descent to 
minimize the loss function defined as the maximum absolute correlation between the features and 
residuals. We also propose a modification of FS e that yields an algorithm for the LASSO, and that 
may be easily extended to an algorithm that computes the LASSO path for different values of the 
regularization parameter. Furthermore, we show that these new algorithms for the LASSO may 
also be interpreted as the same master algorithm (subgradient descent), applied to a regularized 
version of the maximum absolute correlation loss function. We derive novel, comprehensive 
computational guarantees for several boosting algorithms in linear regression (including LS- 
Boost(e) and FS £ ) by using techniques of modern first-order methods in convex optimization. 
Our computational guarantees inform us about the statistical properties of boosting algorithms. 
In particular they provide, for the first time, a precise theoretical description of the amount of 
data-fidelity and regularization imparted by running a boosting algorithm with a prespecified 
learning rate for a fixed but arbitrary number of iterations, for any dataset. 


1 Introduction 


Boosting [19,24,28,38,391 is an extremely successful and popular supervised learning method 
that combines multiple wealQ learners into a powerful “committee.” AdaBoost 120,28 39] is one 
of the earliest boosting algorithms developed in the context of classification. [5j|6j observed that 
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1 this term originates in the context of boosting for classification, where a “weak” classifier is slightly better than 
random guessing. 
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AdaBoost may be viewed as an optimization algorithm, particularly as a form of gradient descent 
in a certain function space. In an influential paper, 24 nicely interpreted boosting methods used in 


classification problems, and in particular AdaBoost, as instances of stagewise additive modeling 29 
- a fundamental modeling tool in statistics. This connection yielded crucial insight about the 
statistical model underlying boosting and provided a simple statistical explanation behind the 
success of boosting methods. (211 provided an interesting unified view of stagewise additive modeling 
and steepest descent minimization methods in function space to explain boosting methods. This 
viewpoint was nicely adapted to various loss functions via a greedy function approximation scheme. 
For related perspectives from the machine learning community, the interested reader is referred to 
the works 132,36 


and the references therein. 


Boosting and Implicit Regularization An important instantiation of boosting, and the topic 
of the present paper, is its application in linear regression. We use the usual notation with model 
matrix X = [Xi,..., X p ] E M nxp , response vector y E M nxl , and regression coefficients f3 E M p . 
We assume herein that the features Xj have been centered to have zero mean and unit £2 norm, 

i.e., ||X;|| 2 = 1 for i = 1 and y is also centered to have zero mean. For a regression 

coefficient vector j3, the predicted value of the response is given by X/3 and r = y — X/3 denotes 
the residuals. 


Least Squares Boosting — LS-Boost(e) Boosting, when applied in the context of linear re¬ 
gression leads to models with attractive statistical properties [7l[8,21 28 . We begin our study by 


describing one of the most popular boosting algorithms for linear regression: LS-BooST(e) proposed 

in I 2 H: 


Algorithm: Least Squares Boosting - LS-BooST(e) 
Fix the learning rate e > 0 and the number of iterations M. 

Initialize at r° = y, /3° = 0, k = 0 . 

1. For 0 < k < M do the following: 

2. Find the covariate jk and Uj k as follows: 


'Um. — 


argmin (f — Xi m u ) 2 for m = 1,... ,p, jk € argmin 1 

- ^ J 1 <m<P 


f'i Xim^m) 


uEM. 


^ i=l 


3. Update the current residuals and regression coefficients as: 


1 


£ ~^-jk^jk 


P. 


k-\-l 

jk 


P 


Jk + £ u jk and 


k+l 


Pi , J + Jk 


A special instance of the LS-BooST(e) algorithm with e = 1 is known as LS-Boost [21] or 
Forward Stagewise 28 — it is essentially a method of repeated simple least squares fitting of 
the residuals [8]. The LS-Boost algorithm starts from the null model with residuals f° = y. 
At the fc-th iteration, the algorithm finds a covariate jk which results in the maximal decrease 
in the univariate regression fit to the current residuals. Let X,, Uj. denote the best univariate fit 
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for the current residuals, corresponding to the covariate j k - The residuals are then updated as 
f k+ 1 «— r k — ~Kj k Uj k and the j k -th regression coefficient is updated as ( S* ; + 1 + u 3k , with all 

other regression coefficients unchanged. We refer the reader to Figure [lj depicting the evolution of 
the algorithmic properties of the LS-BooST(e) algorithm as a function of k and e. LS-BooST(e) has 
old roots — as noted by [8j , LS-Boost with M = 2 is known as “twicing,” a method proposed by 
Tukey [42] . 


LS-BooST(e) is a slow-learning variant of LS-Boost, where to counterbalance the greedy selection 
strategy of the best univariate fit to the current residuals, the updates are shrunk by an additional 
factor of e, as described in Step 3 in Algorithm LS-BooST(e). This additional shrinkage factor e is 
also known as the learning rate. Qualitatively speaking, a small value of e (for example, e = 0.001) 
slows down the learning rate as compared to the choice e = 1. As the number of iterations increases, 
the training error decreases until one eventually attains a least squares fit. For a small value of e, 
the number of iterations required to reach a certain training error increases. However, with a small 
value of e it is possible to explore a larger class of models, with varying degrees of shrinkage. It 
has been observed empirically that this often leads to models with better predictive power [211. In 
short, both M (the number of boosting iterations) and e together control the training error and 
the amount of shrinkage. Up until now, as pointed out by 28 , the understanding of this tradeoff 
has been rather qualitative. One of the contributions of this paper is a precise quantification of 
this tradeoff, which we do in Section [2| 


The papers [Tj-[9 present very interesting perspectives on LS-Boost(e), where they refer to the 
algorithm as L2-BOOST. [8] also obtains approximate expressions for the effective degrees of freedom 
of the L2-BOOST algorithm. In the non-stochastic setting, this is known as Matching Pursuit 
LS-BooST(e) is also closely related to Friedman’s MART algorithm 125 . 


31 


Incremental Forward Stagewise Regression — FS e A close cousin of the LS-BooST(e) al¬ 
gorithm is the Incremental Forward Stagewise algorithm 115,28 presented below, which we refer 
to as FSf. 


Algorithm: Incremental Forward Stagewise Regression - FS e 
Fix the learning rate e > 0 and number of iterations M. 

Initialize at r° = y, /3° = 0, k = 0 . 

1. For 0 < k < M do the following: 

2. Compute: jk £ argmax |(f fc ) T X ; | 

je{i,...,p} 

3. r k+l «— r k — e sgn((r fc ) T Xj fc )Xj fe 

^ <- Pi + £ sgn((r fc ) T X J J and ? k+1 <- % , j / j k . 

In this algorithm, at the fe-th iteration we choose a covariate Xj fe that is the most correlated 
(in absolute value) with the current residual and update the j k - th regression coefficient, along 
with the residuals, with a shrinkage factor e. As in the LS-BooST(e) algorithm, the choice of e 
plays a crucial role in the statistical behavior of the FS £ algorithm. A large choice of e usually 
means an aggressive strategy; a smaller value corresponds to a slower learning procedure. Both the 
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parameters e and the number of iterations M control the data fidelity and shrinkage in a fashion 
qualitatively similar to LS-Boost(s) . We refer the reader to Figure [lj depicting the evolution 
of the algorithmic properties of the FS £ algorithm as a function of k and e. In Section [3] herein, 
we will present for the first time precise descriptions of how the quantities e and M control the 
amount of training error and regularization in FS £ , which will consequently inform us about their 
tradeoffs. 

Note that LS-BooST(e) and FS £ have a lot of similarities but contain subtle differences too, as we 
will characterize in this paper. Firstly, since all of the covariates are standardized to have unit £2 
norm, for same given residual value f k it is simple to derive that Step (2.) of LS-BooST(e) and FS £ 
lead to the same choice of jV However, they are not the same algorithm and their differences are 
rather plain to see from their residual updates, i.e., Step (3.). In particular, the amount of change 
in the successive residuals differs across the algorithms: 

LS-BooST(e) : ||f fe+1 - r fc || 2 = e\(r k ) T X jk \ = e ■ n • ||VL n (/3 fc )||oo . 

FS £ : ||f fc+1 — f k ||2 = e|sfc| where s & = sgn((r fc ) T Xj fc ) , 

where VL n (-) is the gradient of the least squares loss function L n ((3) := 2 ^||y — X/3||2- Note 
that for both of the algorithms, the quantity ||f fc+1 — r k ||2 involves the shrinkage factor e. Their 
difference thus lies in the multiplicative factor, which is n • ||VL n (/3 fc )|| 0O for LS-BooST(e) and 
is |sgn((f fc ) T Xj fc )| for FS £ . The norm of the successive residual differences for LS-BooST(e) is 
proportional to the l^ norm of the gradient of the least squares loss function (see herein equations 
© and Q). For FS £ , the norm of the successive residual differences depends on the absolute 
value of the sign of the jk- th coordinate of the gradient. Note that Sk G {—1,0,1} depending 
upon whether (r fe ) T Xj fc is negative, zero, or positive; and Sk = 0 only when (r^X^ = 0, i.e., 
only when ||VL n (/3 fc )|| 00 = 0 and hence /3 fc is a least squares solution. Thus, for FS £ the £2 norm 
of the difference in residuals is almost always e during the course of the algorithm. For the LS- 
BooST(e) algorithm, progress is considerably more sensitive to the norm of the gradient — as the 
algorithm makes its way to the unregularized least squares fit, one should expect the norm of 
the gradient to also shrink to zero, and indeed we will prove this in precise terms in Section [2j 
Qualitatively speaking, this means that the updates of LS-BooST(e) are more well-behaved when 
compared to the updates of FS £ , which are more erratically behaved. Of course, the additional 
shrinkage factor e further dampens the progress for both algorithms. 

Our results in Section [ 2 ] show that the predicted values X/3 fc obtained from LS-Boost(s') converge 
(at a globally linear rate) to the least squares fit as k —> 00 , this holding true for any value of 
e G (0,1]. On the other hand, for FS £ with e > 0, the iterates X/3 fc need not necessarily converge 
to the least squares fit as k -» 00 . Indeed, the FS £ algorithm, by its operational definition, has a 
uniform learning rate e which remains fixed for all iterations; this makes it impossible to always 
guarantee convergence to a least squares solution with accuracy less than 0(e). While the predicted 
values of LS-BooST(e) converge to a least squares solution at a linear rate, we show in Section [ 3 ] 
that the predictions from the FS £ algorithm converges to an approximate least squares solution, 
albeit at a global sublinear rate0 

2 For the purposes of this paper, linear convergence of a sequence {a;} will mean that a,i —> a and there exists 
a scalar 7 < 1 for which (at — a)/(at -1 — a) < 7 for all i. Sublinear convergence will mean that there is no such 
7 < 1 that satisfies the above property. For much more general versions of linear and sublinear convergence, see [3] 
for example. 
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Since the main difference between FS £ and LS-BooST(e) lies in the choice of the step-size used to 
update the coefficients, let us therefore consider a non-constant step-size/non-uniform learning rate 
version of FS £ , which we call FS £fc . FS £fe replaces Step 3 of FS £ by: 

residual update: f k+1 <— f k — e k sgn((f fc ) T Xj fe )Xj fe 


coefficient update: j3 k + l /3 k k + s k sgn ((f k ) T X jk ) and f3 k+1 <- f3 k ,j / j k , 

where {e k } is a sequence of learning-rates (or step-sizes) which depend upon the iteration index 
k. LS-BooST(e) can thus be thought of as a version of FS £fe , where the step-size e k is given by 
e fc := £Uj k sgn((r). 


and FS £fc , wherein we show 


In Section 3.2 we provide a unified treatment of LS-BooST(e), FS £ 
that all these methods can be viewed as special instances of (convex) subgradient optimization. For 
another perspective on the similarities and differences between FS £ and LS-Boost(s) , see [8j. 


o 

Lij 



O 


P = 


0 


p = 0.5 


P 


= 0.9 



l°gio(N u mbe r of Boosting Iterations) 


Figure 1: Evolution of LS-Boost(£) and FS e versus iterations (in the log-scale), run on a synthetic dataset 
with n = 50, p = 500; the covariates are drawn from a Gaussian distribution with pairwise correlations p. 
The true /3 has ten non-zeros with /3j = l,i < 10 and SNR = 1. Several different values of p and e have 
been considered. [Top Row] Shows the training errors for different learning rates, [Bottom Row] shows the 
t\ norm of the coefficients produced by the different algorithms for different learning rates (here the values 
have all been re-scaled so that the y-axis lies in [0,1]). For detailed discussions about the figure, see the 
main text. 


Both LS-Boost(£) and FS £ may be interpreted as “cautious” versions of Forward Selection or 
Forward Stepwise regression 33 44 , a classical variable selection tool used widely in applied sta- 
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tistical modeling. Forward Stepwise regression builds a model sequentially by adding one variable 
at a time. At every stage, the algorithm identifies the variable most correlated (in absolute value) 
with the current residual, includes it in the model, and updates the joint least squares fit based 
on the current set of predictors. This aggressive update procedure, where all of the coefficients in 
the active set are simultaneously updated, is what makes stepwise regression quite different from 
FS e and LS-BooST(e) — in the latter algorithms only one variable is updated (with an additional 
shrinkage factor) at every iteration. 


Explicit Regularization Schemes While all the methods described above are known to deliver 
regularized models, the nature of regularization imparted by the algorithms are rather implicit. 
To highlight the difference between an implicit and explicit regularization scheme, consider l\- 
regularized regression, namely Lasso 41 , which is an extremely popular method especially for 


high-dimensional linear regression, i.e., when the number of parameters far exceed the number of 
samples. The Lasso performs both variable selection and shrinkage in the regression coefficients, 
thereby leading to parsimonious models with good predictive performance. The constraint version of 
Lasso with regularization parameter 5 > 0 is given by the following convex quadratic optimization 
problem: 


Lasso 


L *n,S'-= m j n 

s.t. 


ally-Xf 


1 ^ 


< <5 


(2) 


The nature of regularization via the Lasso is explicit — by its very formulation, it is set up to find 
the best least squares solution subject to a constraint on the t\ norm of the regression coefficients. 
This is in contrast to boosting algorithms like FS e and LS-Boost(s) , wherein regularization is 
imparted implicitly as a consequence of the structural properties of the algorithm with e and M 
controlling the amount of shrinkage. 


Boosting and Lasso Although Lasso and the above boosting methods originate from different 
perspectives, there are interesting similarities between the two as nicely explored in [15,27,28 . 


For certain datasets the coefficient profile^ of Lasso and FSo are exactly the same [28], where 
FSo denotes the limiting case of the FS £ algorithm as e —> 0+. Figure [2] (top panel) shows an 
example where the Lasso profile is similar to those of FS e and LS-BooST(e) (for small values of 
e). However, they are different in general (Figure [2] bottom panel). Under some conditions on 
the monotonicity of the coefficient profiles of the Lasso solution, the Lasso and FSo profiles are 
exactly the same 15,27 . Such equivalences exist for more general loss functions 37 , albeit under 


fairly strong assumptions on problem data. 


Efforts to understand boosting algorithms in general and in particular the FS e algorithm paved 
the way for the celebrated Least Angle Regression aka the Lar algorithm [l5j| (see also [28]). 
The Lar algorithm is a democratic version of Forward Stepwise. Upon identifying the variable 
most correlated with the current residual in absolute value (as in Forward Stepwise), it moves 

3 By a coefficient profile we mean the map A i—>■ /3 a where, A £ A indexes a family of coefficients j3\. For example, 
the family of LASSO solutions § {/3 s,S > 0} indexed by <5 can also be indexed by the £\ norm of the coefficients, i.e., 
A = ||/3i||i. This leads to a coefficient profile that depends upon the t\ norm of the regression coefficients. Similarly, 
one may consider the coefficient profile of FSo as a function of the £\ norm of the regression coefficients delivered by 
the FSo algorithm. 
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the coefficient of the variable towards its least squares value in a continuous fashion. An appealing 
aspect of the Lar algorithm is that it provides a unified algorithmic framework for variable selection 
and shrinkage - one instance of Lar leads to a path algorithm for the LASSO, and a different 
instance leads to the limiting case of the FS £ algorithm as e —> 0+, namely FSo- In fact, the 
Stagewise version of the Lar algorithm provides an efficient way to compute the coefficient profile 
for FSo. 



£.\ shrinkage of coefficients 


Z\ shrinkage of coefficients 


shrinkage of coefficients 


Figure 2: Coefficient Profiles for different algorithms as a function of the l\ norm of the regression coefficients 
on two different datasets. [Top Panel] Corresponds to the full Prostate Cancer dataset described in Section[6] 
with n = 98 and p = 8. All the coefficient profiles look similar. [Bottom Panel] Corresponds to a subset of 
samples of the Prostate Cancer dataset with n = 10; we also included all second order interactions to get 
p = 44. The coefficient profile of Lasso is seen to be different from FS e and LS-Boost(e) . Figure [9] shows 
the training error vis-a-vis the fi-shrinkage of the models, for the same profiles. 


Due to the close similarities between the Lasso and boosting coefficient profiles, it is natural to 
investigate probable modifications of boosting that might lead to the Lasso solution path. This 
is one of the topics we study in this paper. In a closely related but different line of approach, 45 
describes BLasso, a modification of the FS e algorithm with the inclusion of additional “backward 
steps” so that the resultant coefficient profile mimics the LASSO path. 


Subgradient Optimization as a Unifying Viewpoint of Boosting and Lasso In spite 
of the various nice perspectives on FS e and its connections to the Lasso as described above, the 
present understanding about the relationships between Lasso, FS e , and LS-BooST(e) for arbitrary 
datasets and e > 0 is still fairly limited. One of the aims of this paper is to contribute some 
substantial further understanding of the relationship between these methods. Just like the Lar 
algorithm can be viewed as a master algorithm with special instances being the Lasso and FSo, 
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in this paper we establish that FS £ , LS-BooST(e) and Lasso can be viewed as special instances 
of one grand algorithm: the subgradient descent method (of convex optimization) applied to the 
following parametric class of optimization problems: 


Ps 


minimize 


|X T r| 


1 , 
+ 25 1 


r -y 


where r = y — X/3 for some j3 


(3) 


and where S 6 (0, oo] is a regularization parameter. Here the first term is the maximum absolute 
correlation between the features Xj and the residuals r, and the second term is a regularization 
term that penalizes residuals that are far from the observations y (which itself can be interpreted 
as the residuals for the null model (5 = 0). The parameter 5 determines the relative importance 
assigned to the regularization term, with 6 = +oo corresponding to no importance whatsoever. As 
we describe in Section [IJ Problem ([3]) is in fact a dual of the Lasso Problem Q. 

The subgradient descent algorithm applied to Problem ([3]) leads to a new boosting algorithm that is 
almost identical to FS £ . We denote this algorithm by R-FS £)( 5 (for Regularized incremental Forward 
Stagewise regression). We show the following properties of the new algorithm R-FS^: 

• R-FS £] 5 is almost identical to FS £ , except that it first shrinks all of the coefficients of by 
a scaling factor 1 — I < 1 and then updates the selected coefficient j;- in the same additive 
fashion as FS £ . 

• as the number of iterations become large, R-FS £ ^ delivers an approximate Lasso solution. 

• an adaptive version of R-FS £i , 5 , which we call PATH-R-FS £ , is shown to approximate the path 
of Lasso solutions with precise bounds that quantify the approximation error over the path. 

• R-FS £i 5 specializes to FS £ , LS-BooST(e) and the Lasso depending on the parameter value S 
and the learning rates (step-sizes) used therein. 

• the computational guarantees derived herein for R-FS £i <s provide a precise description of the 
evolution of data-fidelity vis-a-vis t\ shrinkage of the models obtained along the boosting 
iterations. 

• in our experiments, we observe that R-FS £i ,5 leads to models with statistical properties that 
compare favorably with the Lasso and FS £ . It also leads to models that are sparser than 
FS £ . 

We emphasize that all of these results apply to the finite sample setup with no assumptions about 
the dataset nor about the relative sizes of p and n. 


Contributions A summary of the contributions of this paper is as follows: 

1. We analyze several boosting algorithms popularly used in the context of linear regression via 
the lens of first-order methods in convex optimization. We show that existing boosting algo¬ 
rithms, namely FS £ and LS-BooST(e), can be viewed as instances of the subgradient descent 
method aimed at minimizing the maximum absolute correlation between the covariates and 
residuals, namely HX^rHoo. This viewpoint provides several insights about the operational 
characteristics of these boosting algorithms. 


2. We derive novel computational guarantees for FS £ and LS-BooST(e) . These results quantify 
the rate at which the estimates produced by a boosting algorithm make their way towards 
an unregularized least squares fit (as a function of the number of iterations and the learning 
rate s). In particular, we demonstrate that for any value of £ E (0,1] the estimates produced 
by LS-BooST(e) converge linearly to their respective least squares values and the i\ norm 
of the coefficients grows at a rate 0(y/ek). FS £ on the other hand demonstrates a slower 
sublinear convergence rate to an 0(e)-approxinrate least squares solution, while the l\ norm 
of the coefficients grows at a rate O(ek). 

3. Our computational guarantees yield precise characterizations of the amount of data-fidelity 
(training error) and regularization imparted by running a boosting algorithm for k iterations. 
These results apply to any dataset and do not rely upon any distributional or structural 
assumptions on the data generating mechanism. 

4. We show that subgradient descent applied to a regularized version of the loss function 
||X T r||oo, with regularization parameter <5, leads to a new algorithm which we call R-FS £i 5 , 
that is a natural and simple generalization of FS £ . When compared to FS £ , the algorithm 
R-FS £ 5 performs a seemingly minor rescaling of the coefficients at every iteration. As the 
number of iterations k increases, R-FS £i 5 delivers an approximate Lasso solution ([2]). More¬ 
over, as the algorithm progresses, the l\ norms of the coefficients evolve as a geometric series 
towards the regularization parameter value b. We derive precise computational guarantees 
that inform us about the training error and regularization imparted by R-FS^. 

5. We present an adaptive extension of R-FS £j 5 , called PATH-R-FS £ , that delivers a path of 
approximate Lasso solutions for any prescribed grid sequence of regularization parameters. 
We derive guarantees that quantify the average distance from the approximate path traced 
by PATH-R-FS £ to the Lasso solution path. 


Organization of the Paper The paper is organized as follows. In Section [2] we analyze the 
convergence behavior of the LS-BooST(e) algorithm. In Section[3]we present a unifying algorithmic 
framework for FS £ , FS £fe , and LS-BooST(e) as subgradient descent. In Section [4] we present the 
regularized correlation minimization Problem ([3]) and a naturally associated boosting algorithm 
R-FS £)< 5 , as instantiations of subgradient descent on the family of Problems ([3]). In each of the 
above cases, we present precise computational guarantees of the algorithms for convergence of 
residuals, training errors, and shrinkage and study their statistical implications. In Section [5j 
we further expand R-FS £)( j into a method for computing approximate solutions of the LASSO path. 
Section [6] contains computational experiments. To improve readability, most of the technical details 
have been placed in the Appendix [A) 

Notation 

For a vector x E M m , we use Xi to denote the z-t.li coordinate of x. We use superscripts to index 
vectors in a sequence {x k }. Let ej denote the j-th unit vector in M m , and let e = (1,..., 1) denote 
the vector of ones. Let || • || 9 denote the i q norm for q E [1, oo] with unit ball B q , and let ||u||o denote 
the number of non-zero coefficients of the vector v. For A E M mxn , let | q2 := max | Ax\\ q2 

x:\\x\\ qi <l 
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be the operator norm. In particular, ||.A||i 2 = max(||yli|| 2 ,..., H^nlb) is the maximum £2 norm of 

the columns of A. For a scalar a, sgn(a) denotes the sign of a. The notation u v <— argmax{/(w)}” 

ves 

denotes assigning v to be any optimal solution of the problem max{/(«)}. For a convex set P let 

v&S 

IIp(-) denote the Euclidean projection operator onto P, namely Ilp(x) := argrnin xe P \\x — x\\ 2 - 
Let df(-) denote the subdifferential operator of a convex function /(•). If Q 7 ^ 0 is a symmetric 
positive semidehnite matrix, let A max (Q), A m i n (Q), and A pm i n ((3) denote the largest, smallest, and 
smallest nonzero (and hence positive) eigenvalues of Q, respectively. 


2 LS-Boost(£) : Computational Guarantees and Statistical Impli¬ 
cations 

Roadmap We begin our formal study by examining the LS-BooST(e) algorithm. We study the 
rate at which the coefficients generated by LS-BooST(e) converge to the set of unregularized least 
square solutions. This characterizes the amount of data-fidelity as a function of the number of 
iterations and e. In particular, we show (global) linear convergence of the regression coefficients 
to the set of least squares coefficients, with similar convergence rates derived for the prediction 
estimates and the boosting training errors delivered by LS-BooST(e). We also present bounds on 
the shrinkage of the regression coefficients P k as a function of k and e, thereby describing how the 
amount of shrinkage of the regression coefficients changes as a function of the number of iterations 
k. 

2.1 Computational Guarantees and Intuition 

We first review some useful properties associated with the familiar least squares regression prob¬ 
lem: 

LS : L* n := min L n (/3) := ± ||y - X/3||| 

s.t. p eW , 

where L n (-) is the least squares loss, whose gradient is: 

VL n (P) = -iX T (y - X/3) = -iX T r (5) 

where r = y — X/3 is the vector of residuals corresponding to the regression coefficients f3. It follows 
that P is a least-squares solution of LS if and only if VL n (/3) = 0, which leads to the well known 
normal equations: 

0 = -X r (y - X/3) = -X T r . ( 6 ) 

It also holds that: 

n ' ||VL n (^)|| 0O = ||X T r|| 0O = max {|r T Xj|} . (7) 

The following theorem describes precise computational guarantees for LS-BooST(e): linear con¬ 
vergence of LS-BooST(e) with respect to ©. and bounds on the t\ shrinkage of the coefficients 
produced. Note that the theorem uses the quantity A pm i n (X i X) which denotes the smallest nonzero 
(and hence positive) eigenvalue of X 7 X. 
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Theorem 2.1. (Linear Convergence of LS-Boost(e) for Least Squares) Consider the LS- 
BooST(e) algorithm with learning rate e G (0,1], and define the linear convergence rate coefficient 


7- 


7 : = 



e(2 - g)A pmin (X r X 
Ap 


< 1 . 


( 8 ) 


For all k > 0 i/ie following bounds hold: 

(i) (training error): L n 0 k ) - L* n < ^\\Xfi LS \\l ■ l k 


(ii) (regression coefficients): there exists a least squares solution /3 k s such that: 


~P k Lsh< 


WLsh 


\J Ap m in(X T X) 


fc /2 


(in) (predictions): for every least-squares solution /3 ls it holds that 

\\Xfi k -Xfi L sh<WLsh-l k/2 

(iv) (gradient norm/correlation values): ||VL n (/3 fc )|| 0O = ^||X T f fc || 0O < ^11X/5ids'll 2 • 7 fc//2 


(v) (li-shrinkage of coefficients): 


< min 



xMi 


\\xp L3 -xp k \\l 


£^0Lsh ( 1 _ fe/2 

1-V7 V 


(vi) (sparsity of coefficients): 


0 7: 


< k. 


□ 


Before remarking on the various parts of Theorem 2.1 we first discuss the quantity 7 defined in 
(| 8 ]), which is called the linear convergence rate coefficient. We can write 7 = 1 — 4 ) l[x T x) w ^ iere 

nfX . 1 X) is defined to be the ratio k(X 7 X) := ^— (x T x) ~ N°t; e that k(X j X) G [1,oo). To see 


this, let fi be an eigenvector associated with the largest eigenvalue of X 1 X, then: 


0 < A pmin (X T X) < A max (X i X) = 


IX/ 


< 


IX 


l?,l 


< p 


(9) 


where the last inequality uses our assumption that the columns of X have been normalized (whereby 
||X||i, 2 = 1), and the fact that ||/?||i < x/pII/^ 112 - This then implies that 7 G [0.75,1.0) - independent 
of any assumption on the dataset - and most importantly it holds that 7 < 1 . 


Let us now make the following immediate remarks on Theorem 2.1 


• The bounds in parts (i)-(iv) state that the training errors, regression coefficients, predictions, 
and correlation values produced by LS-BooST(e) converge linearly (also known as geometric 
or exponential convergence) to their least squares counterparts: they decrease by at least 
the constant multiplicative factor 7 < 1 for part (i), and by y/fi for parts (ii)-(iv), at every 
iteration. The bounds go to zero at this linear rate as k —> 00 . 
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• The computational guarantees in parts (i) - (vi) provide characterizations of the data-fidelity 
and shrinkage of the LS-BooST(e) algorithm for any given specifications of the learning rate 
e and the number of boosting iterations k. Moreover, the quantities appearing in the bounds 
can be computed from simple characteristics of the data that can be obtained a priori without 
even running the boosting algorithm. (And indeed, one can even substitute ||y ||2 in place of 
||X/3 ls || 2 throughout the bounds if desired since ||X/?ls ||2 < ||y|| 2 -) 


Some Intuition Behind Theorem 2.1 Let us now study the LS-BooST(e) algorithm and build 


intuition regarding its progress with respect to solving the unconstrained least squares problem Q , 
which will inform the results in Theorem 2.1 Since the predictors are all standardized to have unit 
£2 norm, it follows that the coefficient index jk and corresponding step-size uj k selected in Step ( 2 .) 
of LS-BooST(e) satisfy: 


jk G argrnax |(f fc ) T Xj 


and 


u 


= (f k ) T X 


Combining Q and (10), we see that 


Jk 


Jk 


\u 


= |(r fc ) T XjJ = n ■ \\X7L n (f3 h 


‘jk 1 


( 10 ) 


( 11 ) 


Using the formula for u~ k in (10), we have the following convenient way to express the change in 


residuals at each iteration of LS-BooST(e): 


y.k+1 _ ~k _ 


■k\T 


X,JX 


'■Jk 


( 12 ) 


Intuitively, since (12) expresses f k+1 as the difference of two correlated variables, f k and sgn((f fc ) T Xj fe )X 


we expect the squared 1 2 norm of f fc+1 (i.e. its sample variance) to be smaller than that of f k . On 
the other hand, as we see from (jTJ) , convergence of the residuals is ensured by the dependence of the 
change in residuals on \(r k ) T 'Kj k \, which goes to 0 as we approach a least squares solution. In the 
proof of Theorem 2.1 in Appendix |A.2.2 we make this intuition precise by using to quantify 
the amount of decrease in the least squares objective function at each iteration of LS-BooST(e). 
The final ingredient of the proof uses properties of convex quadratic functions (Appendix A.2.1| ) 
to relate the exact amount of the decrease from iteration k to k + 1 to the current optimality gap 
L n (p) ~ L*, which yields the following strong linear convergence property: 


L n 0 k+l )-L* n < 7 • (L n (f3 k ) — L* n ) . 


(13) 


The above states that the training error gap decreases at each iteration by at least the multiplicative 


factor of 7 , and clearly implies item (i) of Theorem 2.1 


'■Jkl 


Comments on the global linear convergence rate in Theorem 2.1 The global linear con¬ 


vergence of LS-BooST(e) proved in Theorem 2.1, while novel, is not at odds with the present 
understanding of such convergence for optimization problems. One can view LS-BooST(e) as per¬ 
forming steepest descent optimization steps with respect to the £\ norm unit ball (rather than the 
1 2 norm unit ball which is the canonical version of the steepest descent method, see (35]). It is 
known [35j that canonical steepest decent exhibits global linear convergence for convex quadratic 
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optimization so long as the Hessian matrix Q of the quadratic objective function is positive definite, 
i.e., A m in (Q) > 0. And for the least squares loss function Q = Ax 7 X, which yields the condition 
that A m i n (X i X) > 0. As discussed in jij, this result extends to other norms defining steepest 
descent as well. Hence what is modestly surprising herein is not the linear convergence per se, but 
rather that LS-BooST(e) exhibits global linear convergence even when A m j n (X 7 X) = 0, i.e., even 
when X does not have full column rank (essentially replacing A m i n (X 7 X) with A pm i n (X 7 X) in 
our analysis). This derives specifically from the structure of the least squares loss function, whose 
function values (and whose gradient) are invariant in the null space of X, i.e., L n (/3 + d) = L n ({3) 
for all d satisfying Xd = 0, and is thus rendered “immune” to changes in (3 in the null space of 
X r X. 


2.2 Statistical Insights from the Computational Guarantees 


Note that in most noisy problems, the limiting least squares solution is statistically less interesting 
than an estimate obtained in the interior of the boosting profile, since the latter typically corre¬ 
sponds to a model with better bias-variance tradeoff. We thus caution the reader that the bounds in 
Theorem |2.1| should not be merely interpreted as statements about how rapidly the boosting itera¬ 
tions reach the least squares fit. We rather intend for these bounds to inform us about the evolution 
of the training errors and the amount of shrinkage of the coefficients as the LS-BooST(e) algorithm 
progresses and when k is at most moderately large. When the training errors are paired with the 
profile of the 4-shrinkage values of the regression coefficients, they lead to the ordered pairs: 


1 

2 n 


|y-x/3* 


k > 1 


(14) 


which describes the data-fidelity and 4-shrinkage tradeoff as a function of k, for the given learning 
rate e > 0. This profile is described in Figure [9] in Appendix | A. 1.1 for several data instances. The 
bounds in Theorem |2.1| provide estimates for the two components of the ordered pair (14), and they 
can be computed prior to running the boosting algorithm. For simplicity, let us use the following 
crude estimate: 

e||X^Ls||2 


:= mm ■ 


HXfrsll: 


ke 

2 — £ 


l->/7 


1 — 75 


which is an upper bound of the bound in part (v) of the theorem, to provide an upper app roximation 
of ||/Jfc||i. Combining the above estimate with the guarantee in part (i) of Theorem 
obtain the following ordered pairs: 


2.1 


in (14), we 




+ L : 


4- 


k > 1 


(15) 


which describe the entire profile of the training error bounds and the 4-shrinkage bounds as a 


function of k as suggested by Theorem 2.1 These profiles, as described above in (15), are illustrated 
in Figure [3| 

It is interesting to consider the profiles of Figure [3] alongside the explicit regularization framework 
of the Lasso ([ 2 ]) which also traces out a profile of the form (|14[): 


S l|y- X « 


<5 > 0 


(16) 
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LS-Boost(s:) algorithm: ^i-shrinkage versus data-fidelity tradeoffs (theoretical bounds) 
Synthetic dataset ( k , = 1) Synthetic dataset (k = 25) Leukemia dataset 




— eps=0.01 

- eps=0.025 

- eps=0.05 

— eps=1 


shrinkage of coefficients 


shrinkage of coefficients 


shrinkage of coefficients 


Figure 3: Figure showing profiles of l\ shrinkage of the regression coefficients versus training error for the 
LS-Boost(e) algorithm, for different values of the learning rate e (denoted by the moniker “eps” in the 
legend). The profiles have been obtained from the computational bounds in Theorem 2.1 The left and 
middle panels correspond to synthetic values of the ratio k = i— 2 — • and for the right panel profiles the value 

''pmin 

of k (here, k = 270.05) is extracted from the Leukemia dataset, described in Section [6j The vertical axes 
have been normalized so that the training error at k = 0 is one, and the horizontal axes have been scaled to 
the unit interval. 


as a function of 5, where, /3| is a solution to the Lasso problem Q. For a value of <5 := l\- the 
optimal objective value of the Lasso problem will serve as a lower bound of the corresponding 
LS-BooST(e) loss function value at iteration k. Thus the training error of delivered by the 
LS-BooST(e) algorithm will be sandwiched between the following lower and upper bounds: 




< T|!y - x^ni < T||xAslll -7 k + K =■ u.,t 


for every k. Note that the difference between the upper and lower bounds above, given by: Ui^—L^ 
converges to zero as k —> oo. Figure [9] in Appendix A. 1.1 shows the training error versus shrinkage 
profiles for LS-BooST(e) and Lasso for different datasets. 


For the bounds in parts (%) and (in) of Theorem 2.1, the asymptotic limits (as k -A oo) are the 
unregularized least squares training error and predictions — which are quantities that are uniquely 
defined even in the underdetermined case. 


The bound in part (ii) of Theorem 2.1 is a statement concerning the regression coefficients. In this 
case, the notion of convergence needs to be appropriately modified from parts (i) and (in), since 
the natural limiting object /3 ls is not necessarily unique. In this case, perhaps not surprisingly, 
the regression coefficients j3 k need not converge. The result in part (ii) of the theorem states 
that (3 k converges at a linear rate to the set of least squares solutions. In other words, at every 
LS-BooST(e) boosting iteration, there exists a least squares solution /3 l S for which the presented 
bound holds. Here /3 k s is in fact the closest least squares solution to in the £2 norm — and the 
particular candidate least squares solution /3 ls ma y be different for each iteration. 
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Figure 4: Figure showing the behavior of 7 [left panel] and A pm i n (X T X) [right panel] for different values 
of p (denoted by the moniker “rho” in the legend) and p , with e = 1. There are ten profiles in each panel 
corresponding to different values of p for p = 0, 0.1, ..., 0.9. Each profile documents the change in 7 as a 
function of p. Here, the data matrix X is comprised of n = 50 samples from a p-dimensional multivariate 
Gaussian distribution with mean zero, and all pairwise correlations equal to p, and the features are then 
standardized to have unit £2 norm. The left panel shows that 7 exhibits a phase of rapid decay (as a 
function of p) after which it stabilizes into the regime of fastest convergence. Interestingly, the behavior 
shows a monotone trend in p: the rate of progress of LS-Boost(£) becomes slower for larger values of p and 
faster for smaller values of p. 


Interpreting the parameters and algorithm dynamics There are several determinants of 
the quality of the bounds in the different parts of Theorem |2.1| which can be grouped into: 


• algorithmic parameters: this includes the learning rate £ and the number of iterations k , and 

• data dependent quantities: ||X/3 lsII 2 , A pm in(X i X), and p. 

The coefficient of linear convergence is given by the quantity 7 := 1 — ? where k(X t X) := 

j — p^rx) • Note that 7 is monotone decreasing in e for e E (0,1], and is minimized at e = 1. 
This simple observation confirms the general intuition about LS-BooST(e): e = 1 corresponds 
to the most aggressive model fitting behavior in the LS-Boost(£) family, with smaller values of e 
corresponding to a slower model fitting process. The ratio /v(X 3 X) is a close cousin of the condition 
number associated with the data matrix X — and smaller values of k(X 3 X) imply a faster rate of 
convergence. 


In the overdetermined case with n > p and rank(X) = p, the condition number k(X 3 X) := 
(X T X) P la y s a key role in determining the stability of the least-squares solution /3ls and in 

measuring the degree of multicollinearity present. Note that k(X 3 X) £ [l,oo), and that the 
problem is better conditioned for smaller values of this ratio. Furthermore, since rank(X) = p it 
holds that A pm i n (X i X) = A m i n (X 7 X), and thus k(X 7 X) < k(X 3 X) by Thus the condition 
number k(X 3 X) always upper bounds the classical condition number ^(X^X), and if A max (X 3 X) 
is close to p, then k(X t X) ~ k(X t X) and t he tw o measures essentially coincide. Finally, since in 


this setup /?ls is unique, part (ii) of Theorem 2.1 implies that the sequence {(3 k } converges linearly 


15 















to the unique least squares solution /3 ls- 

In the underdetermined case with p > n, A m i n (X T X) = 0 and thus k(X T X) = oo. On the other 
hand, k(X 7 X) < oo since A pm i n (X T X) is the smallest nonzero (hence positive) eigenvalue of X r X. 
Therefore the condition number /v(X T X) is similar to the classical condition number R(-) restricted 
to the subspace S spanned by the columns of X (whose dimension is rank(X)). Interestingly, 
the linear rate of convergence enjoyed by LS-BooST(e) is in a sense adaptive — the algorithm 
automatically adjusts itself to the convergence rate dictated by the parameter 7 “as if” it knows 
that the null space of X is not relevant. 


Dynamics of the LS-Boost(e) algorithm versus number of boosting iterations 


p = 0 p = 0.5 




Number of Boosting Iterations Number of Boosting Iterations 


p = 0.9 



1 -1-1-1-1-1- 

0 1000 2000 3000 4000 5000 6000 


Number of Boosting Iterations 


Figure 5: Showing the LS-BooST(e) algorithm run on the same synthetic dataset as was used in Figure |9j 
with p = 500 and e = 1, for three different values of the pairwise correlation p. A point is “on” if the 
corresponding regression coefficient is updated at iteration k. Here the vertical axes have been reoriented so 
that the coefficients that are updated the maximum number of times appear lower on the axes. For larger 
values of p, we see that the LS-Boost(e) algorithm aggressively updates the coefficients for a large number 
of iterations, whereas the dynamics of the algorithm for smaller values of p are less pronounced. For larger 
values of p the LS-Boost(£) algorithm takes longer to reach the least squares fit and this is reflected in the 
above figure from the update patterns in the regression coefficients. The dynamics of the algorithm evident 
in this figure nicely complements the insights gained from Figure fT| 


As the dataset is varied, the value of 7 can change substantially from one dataset to another, 


thereby leading to differences in the convergence behavior bounds in parts (i)-(v) of Theorem 2.1 


To settle all of these ideas, we can derive some simple bounds on 7 using tools from random matrix 
theory. Towards this end, let us suppose that the entries of X are drawn from a standard Gaussian 
ensemble, which are subsequently standardized such that every column of X has unit £2 norm. Then 
it follows from random matrix theory 43 that A pm i n (X 7 X) ^(y/p — y/n ) 2 with high probability. 


(See Appendix A.2.4 for a more detailed discussion of this fact.) To gain better insights into the 
behavior of 7 and how it depends on the values of pairwise correlations of the features, we performed 
some computational experiments, the results of which are shown in Figure [4| Figure [4] shows the 
behavior of 7 as a function of p for a fixed n = 50 and e = 1, for different datasets X simulated 
as follows. We first generated a multivariate data matrix from a Gaussian distribution with mean 
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zero and covariance T, pxp = (try), where, a t j = p for all i ^ j\ and then all of the columns of the 
data matrix were standardized to have unit I 2 norm. The resulting matrix was taken as X. We 
considered different cases by varying the magnitude of pairwise correlations of the features p - 
when p is small, the rate of convergence is typically faster (smaller 7 ) and the rate becomes slower 
(higher 7) for higher values of p. Figure [4] shows that the coefficient of linear convergence 7 is quite 
close to 1.0 — which suggests a slowly converging algorithm and confirms our intuition about the 
algorithmic behavior of LS-Boost(e) . Indeed, LS-BooST(e), like any other boosting algorithm, 
should indeed converge slowly to the unregularized least squares solution. The slowly converging 
nature of the LS-BooST(e) algorithm provides, for the first time, a precise theoretical justification 
of the empirical observation made in [28] that stagewise regression is widely considered ineffective 
as a tool to obtain the unregularized least squares fit, as compared to other stepwise model fitting 
procedures like Forward Stepwise regression (discussed in Section [I]). 


The above discussion sheds some interesting insight into the behavior of the LS-BooST(e) algo¬ 
rithm. For larger values of p, the observed covariates tend to be even more highly correlated (since 
p > n). Whenever a pair of features are highly correlated, the LS-Boost(e) algorithm finds it 
difficult to prefer one over the other and thus takes turns in updating both coefficients, thereby 
distributing the effects of a covariate to all of its correlated cousins. Since a group of correlated 
covariates are all competing to be updated by the LS-BooST(e) algorithm, the progress made by 
the algorithm in decreasing the loss function is naturally slowed down. In contrast, when p is small, 
the LS-BooST(e) algorithm brings in a covariate and in a sense completes the process by doing 
the exact line-search on that feature. This heuristic explanation attempts to explain the slower 
rate of convergence of the LS-BooST(e) algorithm for large values of p — a phenomenon that we 
observe in practice and which is also substantiated by the computational guarantees in Theorem 
2.1 We refer the reader to Figures [l] and [5] which further illustrate the above justification. State¬ 
ment (v) of Theorem |2.1| provides upper bounds on the l\ shrinkage of the coefficients. Figure [3] 
illustrates the evolution of the data-fidelity versus ^-shrinkage as obtained from the computational 

Some additional discussion and properties of LS-BooST(e) are presented 


bounds in Theorem 2.1 


in Appendix A.2.3 


3 Boosting Algorithms as Subgradient Descent 


Roadmap In this section we present a new unifying framework for interpreting the three boosting 
algorithms that were discussed in Section [lj namely FS £ , its non-uniform learning rate extension 
FS £;c , and LS-Boost(e). We show herein that all three algorithmic families can be interpreted 
as instances of the subgradient descent method of convex optimization, applied to the problem of 
minimizing the largest correlation between residuals and predictors. Interestingly, this unifying lens 
will also result in a natural generalization of FS £ with very strong ties to the Lasso solutions, as 
we will present in Sections [4] and [5] The framework presented in this section leads to convergence 


guarantees for FS £ and FS £fc . In Theorem 3.1 herein, we present a theoretical description of the 


evolution of the FS £ algorithm, in terms of its data-fidelity and shrinkage guarantees as a function of 
the number of boosting iterations. These results are a consequence of the computational guarantees 
for FS £ that inform us about the rate at which the FS £ training error, regression coefficients, and 
predictions make their way to their least squares counterparts. In order to develop these results, 
we first motivate and brieffy review the subgradient descent method of convex optimization. 
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3.1 Brief Review of Subgradient Descent 


We briefly motivate and review the subgradient descent method for non-differentiable convex opti¬ 
mization problems. Consider the following optimization problem: 


f* := min f(x) 

X 

s.t. x £ P , 


(17) 


where P C M n is a closed convex set and /(•) : P —> M is a convex function. If /(•) is differentiable, 
then /(•) will satisfy the following gradient inequality: 

f(y) > fix) + Vf{x) T (y - x) for any x, y £ P , 

which states that /(•) lies above its first-order (linear) approximation at x. One of the most intuitive 
optimization schemes for solving ( fT7| ) is the method of gradient descent. This method is initiated 
at a given point x° £ P. If x k is the current iterate, then the next iterate is given by the update 
formula: x fc+1 <— IIp(x fc — a^V f{x k )). In this method the potential new point is x k — a^V f(x k ), 
where ay- > 0 is called the step-size at iteration k, and the step is taken in the direction of the 
negative of the gradient. If this potential new point lies outside of the feasible region P, it is 
then projected back onto P. Here recall that n P (-) is the Euclidean projection operator, namely 
n P (x) := argmin ?yG p \\x - y || 2 . 


Now suppose that /(•) is not differentiable. By virtue of the fact that /(•) is convex, then /(•) 
will have a subgradient at each point x. Recall that g is a subgradient of /(•) at x if the following 
subgradient inequality holds: 

fiv) > fix) + g T iy - x) for all y e P , (18) 


which generalizes the gradient inequality above and states that /(•) lies above the linear function 
on the right side of (18). Because there may exist more than one subgradient of /(•) at x, let <9/(x) 
denote the set of subgradients of /(•) at x. Then 11 g £ df(x)” denotes that g is a subgradient of 
/(•) at the point x, and so g satisfies |l8| ) for all y. The subgradient descent method (see [401, for 
example) is a simple generalization of the method of gradient descent to the case when /(•) is not 
differentiable. One simply replaces the gradient by the subgradient, yielding the following update 
scheme: 


Compute a subgradient of /(•) at x k : g k £ df{x k ) 

Peform update at x k : x k+1 <— np(x fc — akg k ) ■ 


(19) 


The following proposition summarizes a well-known computational guarantee associated with the 
subgradient descent method. 


Proposition 3.1. (Convergence Bound for Subgradient Descent |34, 35|) Consider the 
subgradient descent method (19), using a constant step-size ag = a for all i. Let x* be an optimal 
solution of © and suppose that the subgradients are uniformly bounded, namely \\g l \\2 < G for 
all i > 0. Then for each k > 0, the following inequality holds: 


min fix' 1 ) < f* + 


|x° —x*||| aG 2 

2(k + l)a + ~Y~ 


□ 


( 20 ) 
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The left side of (20) is simply the best objective function value obtained among the first k iterations. 
The right side of (20) bounds the best objective function value from above, namely the optimal 
value f* plus a nonnegative quantity that is a function of the number of iterations k, the constant 
step-size {cti}, the bound G on the norms of subgradients, and the distance from the initial point to 

an optimal solution x* of ( [l7| ). Note that for a fixed step-size a > 0, the right sid e of (20) goes to 

aGf — ' 1 — 

non 


as k —> oo. In the interest of completeness, we include a proof of Proposition |3.l| in Appendix 


3.2 A Subgradient Descent Framework for Boosting 

We now show that the boosting algorithms discussed in Section [lj namely FS e and its relatives 
FS £fe and LS-BooST(e), can all be interpreted as instantiations of the subgradient descent method 
to minimize the largest absolute correlation between the residuals and predictors. 

Let .P res := {r E M n : r = y — X/3 for some j3 E M p } denote the affine space of residuals and consider 
the following convex optimization problem: 

Correlation Minimization (CM) : f* := min 

r 

S.t. 

which we dub the “Correlation Minimization” problem, or CM for short. Note an important 
subtlety in the CM problem, namely that the optimization variable in CM is the residual r and not 
the regression coefficient vector (3. 


J KG 
r E Prt 




( 21 ) 


Since the columns of X have unit £2 norm by assumption, /(r) is the largest absolute correlation 
between the residual vector r and the predictors. Therefore ( |21| ) is the convex optimization problem 
of minimizing the largest correlation between the residuals and the predictors, over all possible 
values of the residuals. From ©> with r = y — ~Kf3 we observe that X 7 r = 0 if and only if /3 
is a least squares solution, whereby f(r) = ||X T r|| 0O = 0 for the least squares residual vector 


r = f L s = y — X/?ls- Since the objective function in (21) is nonnegative, we conclude that f* = 0 


and the least squares residual vector tls is also the unique optimal solution of the CM problem 


(21). Thus CM can be viewed as an optimization problem which also produces the least squares 


solution. 


FS £fc and LS-BooST(e) 

can all be viewed as instantiations of the subgradient descent method to solve the CM problem 


The following proposition states that the three boosting algorithms FS £ , ± *-> £k 


( 21 ). 


Proposition 3.2. Consider the subgradient descent method (19) with step-size sequence {ak} to 
solve the correlation minimization (CM) problem (21), initialized at f° = y. Then: 


(i) the FS £ algorithm is an instance of subgradient descent, with a constant step-size ak := £ at 
each iteration, 

(ii) the FS £k algorithm is an instance of subgradient descent, with non-uniform step-sizes ak ■= £k 
at iteration k, and 
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(Hi) the LS-BooST(e) algorithm is an instance of subgradient descent, with non-uniform step-sizes 


ak := e\ Uj k \ at iteration k, where Uj k := argmin u \\r k — X 


3 k 


u Hi- 


Proof. We first prove (i). Recall the update of the residuals in FS £ : 

f w = f fc - £ . Sgn( (ff Xj J4 . 

We first show that g k := sgn((f fc ) :? Xj k )Xj k is a subgradient of the objective function f(r) = 
||X 7 r||oo °f the correlation minimization problem CM (21) at r = r k . At iteration k, FS £ chooses 
the coefficient to update by selecting jk G argmax \(r k )^Xj\, whereby 

sgn((r k )' 1 Xj k ) ((f fc ) 7 Xj k ) = ||X 7 (r fc )||oo) an d therefore for any r it holds that: 


f(r) = ||X T r| 


> 


sgn((r fc ) T XjJ ((X Jk ) T r) 

= sgn ((r k ) T X jk ) (( X jk ) T (f k + r - r k )) 

= ||X r (r fc )|| 00 + sgn((f fc ) T X i J ((Xj fe ) T (r - f k )) 

= f(r k ) +sgn((f fe ) T X jfc ) ((X jk ) T (r-r k )) . 




Jk 


is a 


Therefore using the definition of a subgradient in (18), it follows that g k := sgn ((r k ) T 
subgradient of f(r) = HX^rjloo at r = r k . Therefore the update f k+1 = f k — e ■ sgn ((f k ) T Xj k )Xj k 
is of the form f k+1 = r k — eg k where g k G df(f k ). Last of all notice that the update can also be 
written as f k —eg k = r k+1 = y — X/3 k+1 G P res , hence FI p res (r k — eg k ) = r k — eg k , i.e., the projection 
step is superfluous here, and therefore r k+1 = Flp rea (f fc — sg k ), which is precisely the update for the 
subgradient descent method with step-size ak := e. 

The proof of (ii) is the same as (i) with a step-size choice of ak = £k at iteration k. Furthermore, 
as discussed in Section [IJ LS-BooST(e) may be thought of as a specific instance of FS £fe , whereby 
the proof of (Hi) follows as a special case of (ii). □ 

Proposition |3.2| presents a new interpretation of the boosting algorithms FS £ and its cousins as sub¬ 
gradient descent. This is interesting especially since FS £ and LS-BooST(e) have been traditionally 


interpreted as greedy coordinate descent or steepest descent type procedures 25,28 . This has the 
following consequences of note: 

• We take recourse to existing tools and results about subgradient descent optimization to 
inform us about the computational guarantees of these methods. When translated to the 
setting of linear regression, these results will shed light on the data fidelity vis-a-vis shrinkage 
characteristics of FS £ and its cousins — all using quantities that can be easily obtained prior 


to running the boosting algorithm. We will show the details of this in Theorem 3.1 below. 


The subgradient optimization viewpoint provides a unifying algorithmic theme which we will 
also apply to a regularized version of problem CM and that we will show is very strongly 
connected to the Lasso. This will be developed in Section |4j Indeed, the regularized version 
of the CM problem that we will develop in Section [4] will lead to a new family of boosting 
algorithms which are a seemingly minor variant of the basic FS £ algorithm but deliver (O(e)- 
approximate) solutions to the Lasso. 
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3.3 Deriving and Interpreting Computational Guarantees for FS £ 


The following theorem presents the convergence properties of FS £ , which are a consequence of the 
interpretation of FS £ as an instance of the subgradient descent method. 


Theorem 3.1. (Convergence Properties of FS £ ) Consider the FS e algorithm with learning 
rate e. Let k > 0 be the total number of iterations. Then there exists an index i E {0,..., k} for 
which the following bounds hold: 


(i) (training error): L n (/3 l ) - L* n < 2 nA ~(5cnC) 


WlsWI , 

e(k+ 1 ) 


(ii) (regression coefficients): there exists a least squares solution (3 l LS such that: 


Pish < 


Vp 


A pmin (X r X) 


WlsWI 

e(k + 1) 


(Hi) (predictions): for every least-squares solution /3 ls it holds that 


IIX/T 


X/3 L5 1| 2 < 


Vp 


V ^p mi n(X r X) 


WlsWI 

e(k + 1) 


(iv) (correlation values) 11 X-^-r* 1 


OO — 


VVhsg e 

2 e(k + l) 2 


(v) (V-shrinkage of coefficients): ||/T||i < ke 

(vi) (sparsity of coefficients): ||/3*||o < k . 


The proof of Theorem |3.1| is presented in Appendix A.3.2 


□ 


Interpreting the Computational Guarantees 

orem 


Theorem 3.1 accomplishes for FS e what The- 


2.1 did for LS-BooST(e)— parts (i) - (iv) of the theorem describe the rate in which the 


training error, regression coefficients, and related quantities make their way towards their (0(e)- 
approximate) unregularized least squares counterparts. Part (v) of the theorem also describes the 
rate at which the shrinkage of the regression coefficients evolve as a function of the number of boost¬ 
ing iterations. The rate of convergence of FS £ is sublinear, unlike the linear rate of convergence for 
LS-BooST(e). Note that this type of sublinear convergence implies that the rate of decrease of 
the training error (for instance) is dramatically faster in the very early iterations as compared to 


later iterations. Taken together, Theorems |3.1| and 2.1 highlight an important difference between 
the behavior of algorithms LS-BooST(e) and FS £ : 


• the limiting solution of the LS-BooST(e) algorithm (as k —> oo) corresponds to the unregu¬ 
larized least squares solution, but 

• the limiting solution of the FS £ algorithm (as k —> oo) corresponds to an 0(e) approximate 
least squares solution. 
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FS e algorithm: shrinkage versus data-fidelity tradeoffs (theoretical bounds) 


Synthetic dataset (k = 1) 


Leukemia dataset 


Leukemia dataset (zoomed) 





i\ shrinkage of coefficients 


shrinkage of coefficients 


shrinkage of coefficients 


Figure 6: Figure showing profiles of l\ shrinkage bounds of the regression coefficients versus training error 
bounds for the FS E algorithm, for different values of the learning rate e. The profiles have been obtained 
from the bounds in parts (i) and (v) of Theorem 3.1 The left panel corresponds to a hypothetical dataset 
using k = y- 2 — = 1, and the middle and right panels use the parameters of the Leukemia dataset. 

"Dmin 


As demonstrated in Theorems 2.1 and 3.1 both LS-BooST(e) and FS e have nice convergence prop¬ 
erties with respect to the unconstrained least squares problem ©• However, unlike the convergence 
results for LS-BooST(e) in Theorem 2.1, FS e exhibits a sublinear rate of convergence towards a 
suboptimal least squares solution. For example, part (i) of Theorem 3.1 implies in the limit as 
k —> oo that FS E identifies a model with training error at most: 

-2 


T* , P £ 

n + 2n(A pmin (X r X)) 


( 22 ) 


In addition, part (ii) of Theorem 3.1 implies that as k 
to the set of least squares solutions {/?ls : X 7 X/3ls = X 7 y} is at most: 


oo, FS e identifies a model whose distance 
T^ r 1 „+■ £ VP 




dX T X)- 


3.1 


involve the quantities A 


pmin 1 


X) and 


Note that the computational guarantees in Theorem 
||X/3 L s||2, assuming n and p are fixed. To settle ideas, let us consider the synthetic datasets used 
in Figures [4] and [lj where the covariates were generated from a multivariate Gaussian distribution 
with pairwise correlation p. Figure [ 4 ] suggests that A pm i n (X 2 X) decreases with increasing p values. 
Thus, controlling for other factors appearing in the computational bound^] , it follows from the 
statements of Theorem 3.1 that the training error decreases much more rapidly for smaller p values, 
as a function of k. This is nicely validated by the computational results in Figure [I] (the three top 
panel figures), which show that the training errors decay at a faster rate for smaller values of 
P- 


Let us examine more carefully the properties of the sequence of models explored by FS e and the 
corresponding tradeoffs between data fidelity and model complexity. Let TBound and SBound 

4 To control for other factors, for example, we may assume that p > n and for different values of p we have 
||X/3 ls|| 2 = ||y ||2 = 1 with e fixed across the different examples. 
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denote the training error bound and shrinkage bound in parts (i) and (v) of Theorem 3.1, respec¬ 
tively. Then simple manipulation of the arithmetic in these two bounds yields the following tradeoff 
equation: 


TBound = 


P 


2nA pmin (X T X) 


jjXftgjU 

SBound + e 


+ S 


The above tradeoff between the training error bound and the shrinkage bound is illustrated in 
Figure [ 6 j which shows this tradeoff curve for four different values of the learning rate e. Except 
for very small shrinkage levels, lower values of e produce smaller training errors. But unlike the 
corresponding tradeoff curves for LS-BooST(e), there is a range of values of the shrinkage for which 
smaller values of e actually produce larger training errors, though admittedly this range is for very 
small shrinkage values. For more reasonable shrinkage values, smaller values of e will correspond 
to smaller values of the training error. 


Part (v) of Theorems |2.1| and |3.f [ presents shrinkage bounds for FS £ and LS-Boost(£) , respectively. 
Let us briefly compare these bounds. Examining the shrinkage bound for LS-BooST(e), we can 
bound the left term from above by \/&\/£||X/ 3 ls|| 2 - We can also bound the right term from above 
by £r 11 X/5 ls 11 2 / (1 — \/l) where recall from Section [ 2 ] that 7 is the linear convergence rate coefficient 


7 := 1 - 


^(2—e)A pmin (X T X) 


LS-BooST(e): 


4 p 


We may therefore alternatively write the following shrinkage bound for 


H/^'Hi < ||X/^ s || 2 min , e/(l - >/t)} ■ ( 23 ) 

The shrinkage bound for FS £ is simply ke. Comparing these two bounds, we observe that not only 
does the shrinkage bound for FS £ grow at a faster rate as a function of k for large enough k, but also 
the shrinkage bound for FS £ grows unbounded in k, unlike the right term above for the shrinkage 
bound of LS-BooST(e). 

One can also compare FS £ and LS-BooST(e) in terms of the efficiency with which these two methods 
achieve a certain pre-specified data-fidelity. In Appendix |A.3.3| we show, at least in theory, that 
LS-BooST(e) is much more efficient than FS £ at achieving such data-fidelity, and furthermore it 
does so with much better shrinkage. 


4 Regularized Correlation Minimization, Boosting, and Lasso 


Roadmap In this section we introduce a new boosting algorithm, parameterized by a scalar 
5 > 0 , which we denote by R-FS £i( j (for Regularized incremental Forward Stagewise regression), 
that is obtained by incorporating a simple rescaling step to the coefficient updates in FS £ . We then 


introduce a regularized version of the Correlation Minimization (CM) problem (21) which we refer 


to as RCM. We show that the adaptation of the subgradient descent algorithmic framework to the 
Regularized Correlation Minimization problem RCM exactly yields the algorithm R-FS £)( j. The 
new algorithm R-FS £i ,5 may be interpreted as a natural extension of popular boosting algorithms 
like FS £ , and has the following notable properties: 
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• Whereas FS £ updates the coefficients in an additive fashion by adding a small amount e 
to the coefficient most correlated with the current residuals, R-FS £i< 5 first shrinks all of the 
coefficients by a scaling factor 1 — | < 1 and then updates the selected coefficient in the same 
additive fashion as FS £ . 

• R-FS £i 5 delivers 0(e)-accurate solutions to the Lasso in the limit as k —> oo, unlike FS £ 
which delivers 0(e')-accurate solutions to the unregularized least squares problem. 

• R-FS £] 5 has computational guarantees similar in spirit to the ones described in the context of 
FS £ - these quantities directly inform us about the data-hdelity vis-a-vis shrinkage tradeoffs 
as a function of the number of boosting iterations and the learning rate s. 


The notion of using additional regularization along with the implicit shrinkage imparted by boosting 
is not new in the literature. Various interesting notions have been proposed in 10,14,22 26,45 


see also the discussion in Appendix |A.4.4 herein. However, the framework we present here is new. 
We present a unified subgradient descent framework for a class of regularized CM problems that 
results in algorithms that have appealing structural similarities with forward stagewise regression 
type algorithms, while also being very strongly connected to the Lasso. 


Boosting with additional shrinkage — R-FS £( s Here we give a formal description of the 
R-FS £)< 5 algorithm. R-FS £i ^ is controlled by two parameters: the learning rate e, which plays the 
same role as the learning rate in FS £ , and the “regularization parameter” 8 > e. Our reason 
for referring to <5 as a regularization parameter is due to the connection between R-FS £i 5 and the 
Lasso, which will be made clear later. The shrinkage factor, i.e., the amount by which we shrink 
the coefficients before updating the selected coefficient, is determined as 1 — |. Supposing that we 
choose to update the coefficient indexed by j k at iteration k , then the coefficient update may be 
written as: 

fik+i 4- (i - |) + e • sgn((f fe ) T Xj fc ) ej k . 

Below we give a concise description of R-FS £i< 5 , including the update for the residuals that corre¬ 
sponds to the update for the coefficients stated above. 

Algorithm: R-FS £i 5 

Fix the learning rate e > 0, regularization parameter 5 > 0 such that e < 8 , and number of 
iterations M. 

Initialize at f° = y, /3° = 0, k = 0. 

1. For 0 < k < M do the following: 

2. Compute: j k € argmax |(f fc ) T Xj| 

je{i,...,p} 

3. r k+1 i — f k — e [sgn((f fc ) 1 X jk )X jk + |(r fc - y)] 

/% +1 <- (1 - !) Pi + e sgn ((f k ) T X jk ) and j) k+1 4- (i - f) $ , j + j k 

Note that R-FS £j ^ and FS £ are structurally very similar - and indeed when 8 = oo then R-FS £) 5 is 
exactly FS £ . Note also that R-FS £i 5 shares the same upper bound on the sparsity of the regression 
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coefficients as FS £ , namely for all k it holds that: ||/3 fc ||o < k. When 5 < oo then, as previously 
mentioned, the main structural difference between R-FS £i , and FS £ is the additional rescaling of 
the coefficients by the factor 1 — 1. This rescaling better controls the growth of the coefficients and, 
as will be demonstrated next, plays a key role in connecting R-FS £) , to the Lasso. 


Regularized Correlation Minimization (RCM) and Lasso The starting point of our formal 


analysis of R-FS £ , is the Correlation Minimization (CM) problem (21), which we now modify by 
introducing a regularization term that penalizes residuals that are far from the vector of observations 
y. This modification leads to the following parametric family of optimization problems indexed by 
6 6 (0, oo]: 


RCM, : /,* := min /,(r) := HX^Hoo + ^||r - y| 


s.t. r G P Tes := {r G M n : r = y — X/3 for some (5 G 


(24) 


where “RCM” connotes Regularlized Correlation Minimization. Note that RCM reduces to the 


correlation minimization problem CM (21) when S = oo. RCM may be interpreted as the problem 


of minimizing, over the space of residuals, the largest correlation between the residuals and the 
predictors plus a regularization term that penalizes residuals that are far from the response y 
(which itself can be interpreted as the residuals associated with the model j3 = 0). 


Interestingly, as we show in Appendix A.4.1 RCM (24) is equivalent to the Lasso ([2]) via du¬ 
ality. This equivalence provides further insight about the regularization used to obtain RCM,. 
Comparing the Lasso and RCM, notice that the space of the variables of the Lasso is the space 
of regression coefficients /3, namely M p , whereas the space of the variables of RCM is the space of 
model residuals, namely M n , or more precisely P res . The duality relationship shows that RCM, ( |24| ) 
is an equivalent characterization of the LASSO problem, just like the correlation minimization (CM) 


problem (21) is an equivalent characterization of the (unregularized) least squares problem. Recall 


that Proposition 3.2 showed that subgradient descent applied to the CM problem (24) (which is 
RCM, with 6 = oo) leads to the well-known boosting algorithm FS £ . We now extend this theme 
with the following Proposition, which demonstrates R-FS £j , is equivalent to subgradient descent 
applied to RCM,. 

Proposition 4.1. The R-FS e ^s algorithm is an instance of subgradient descent to solve the regular¬ 
ized correlation minimization (RCM,J problem (24), initialized at r° = y, with a constant step-size 
ccfc := e at each iteration. 


The proof of Proposition |4.1| is presented in Appendix |A.4.2 


4.1 R-FS £ ,: Computational Guarantees and their Implications 


In this subsection we present computational guarantees and convergence properties of the boosting 
algorithm R-FS £i ,. Due to the structural equivalence between R-FS £) , and subgradient descent 


applied to the RCM, problem (24) (Proposition 4.1) and the close connection between RCM, 
and the Lasso (Appendix |A.4.1 ), the convergence properties of R-FS £i , are naturally stated with 
respect to the Lasso problem (|2|) . Similar to Theorem 3.1 which described such properties for 
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FS £ (with respect to the unregularized least squares problem), we have the following properties for 
R-FS e , 5 . 


Theorem 4.1. (Convergence Properties of R-FS £ 5 for the Lasso ) Consider the R-FS E: ,5 
algorithm with learning rate e and regularization parameter 6 £ (0, 00 ), where e < 5. Then the 
regression coefficient f3 k is feasible for the Lasso problem © for all k > 0. Let k > 0 denote a 
specific iteration counter. Then there exists an index i £ {0, ..., k} for which the following bounds 
hold: 


(i) (training error): L n (/3 l ) — L* n s 



— n 


pt/yii 

2e(fc+l) 


+ 2e 


(ii) (predictions): for every Lasso solution it holds that 


||X/3* — X/3 ||| 2 < 


e{k + 1 ) 


+ 4 5 e 


(Hi) (i\-shrinkage of coefficients): ||/3*||i < 6 1 — (l — 


< 5 


(iv) (sparsity of coefficients): 


lo _ 


< k 


The proof of Theorem |4.1| is presented in Appendix A.4.3 


n 


R-FS £j 5 algorithm, Prostate cancer dataset (computational bounds) 





Iterations 


Iterations 


Iterations 


Figure 7: Figure showing the evolution of the R-FS e ^ algorithm (with e = 10 -4 ) for different values of 
<5, as a function of the number of boosting iterations for the Prostate cancer dataset, with n = 10, p = 44, 
appearing in the bottom panel of Figure [8] [Left panel] shows the change of the ih-norm of the regression 
coefficients. [Middle panel] shows the evolution of the training errors, and [Right panel] is a zoomed-in 
version of the middle panel. Here we took different values of 6 given by 6 = frac x <5 max , where, d max denotes 
the fi-norm of the minimum l \-norm least squares solution, for 7 different values of frac. 


Interpreting the Computational Guarantees The statistical interpretations implied by the 
computational guarantees presented in Theorem [41] are analogous to those previously discussed for 
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LS-BooST(e) (Theorem 2.1) and FS £ (Theorem 3.1). These guarantees inform us about the data- 


fidelity vis-a-vis shrinkage tradeoffs as a function of the number of boosting iterations, as nicely 
demonstrated in Figure [7j There is, however, an important differentiation between the properties 
of R-FS £) 5 and the properties of LS-BooST(e) and FS £ , namely: 


For LS-BooST(e) and FS £ , the computational guarantees (Theorems 2.1 and 3.1) describe 


how the estimates make their way to a unregularized (O(e)-approximate) least squares solu¬ 
tion as a function of the number of boosting iterations. 


For R-FS £i 5 , our results (Theorem 4.1) characterize how the estimates approach a (O(e)- 
approximate) Lasso solution. 


Notice that like FS £ , R-FS £i 5 traces out a profile of regression coefficients. This is reflected in item 
(Hi) of Theorem 4.1 which bounds the ^-shrinkage of the coefficients as a function of the number 
of boosting iterations k. Due to the rescaling of the coefficients, the ^i-shrinkage may be bounded 
by a geometric series that approaches 5 as k grows. Thus, there are two important aspects of the 
bound in item (Hi): (a) the dependence on the number of boosting iterations k which characterizes 
model complexity during early iterations, and (b) the uniform bound of 5 which applies even in 
the limit as k —> oo and implies that all regression coefficient iterates j3 k are feasible for the Lasso 
problem Q. 


On the other hand, item (%) characterizes the quality of the coefficients with respect to the Lasso 
solution, as opposed to the unregularized least squares problem as in FS £ . In the limit as k —> oo, 
item (i) implies that R-FS £j ^ identifies a model with training error at most L* 5 ^ . This upper 

bound on the training error may be set to any prescribed error level by appropriately tuning e; in 
particular, for e ~ 0 and fixed 6 > 0 this limit is essentially L* (5 . Thus, combined with the uniform 
bound of 6 on the ^-shrinkage, we see that the R-FS £i( j algorithm delivers the Lasso solution in 
the limit as k -A oo. 


It is important to emphasize that R-FS £]< 5 should not just be interpreted as an algorithm to solve 
the Lasso. Indeed, like FS £ , the trajectory of the algorithm is important and R-FS £; 5 may identify 
a more statistically interesting model in the interior of its profile. Thus, even if the Lasso solution 
for 5 leads to overfitting, the R-FS £i ,5 updates may visit a model with better predictive performance 
by trading off bias and variance in a more desirable fashion suitable for the particular problem at 
hand. 


Figure [8] shows the profiles of R-FS £j( j for different values of 5 < 5 ,where <5 max is the Ij-norm of 
the minimum tj-norm least squares solution. Curiously enough, Figure [8] shows that in some cases, 
the profile of R-FS £i ,5 bears a lot of similarities with that of the Lasso (as presented in Figure [2]). 
However, the profiles are in general different. Indeed, R-FS £i< 5 imposes a uniform bound of 5 on 
the L]-shrinkage, and so for values larger than 6 we cannot possibly expect R-FS £i a to approximate 
the Lasso path. However, even if <5 is taken to be sufficiently large (but finite) the profiles may 
be different. In this connection it is helpful to draw the analogy between the curious similarities 
between the FS £ (i.e., R-FS £i 5 with 5 = oo) and Lasso coefficient profiles, even though the profiles 
are different in general. 


As a final note, we point out that one can also interpret R-FS £j( 5 as the Frank-Wolfe algorithm in 
convex optimization applied to the LASSO Q in line with [;2j. We refer the reader to Appendix 
A.4.5|for discussion of this point. 
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FS e R-FS,.*, <5 = 0.99<5 max R-FS Ei 5,i 5 = 0.915 max R-FS £i< 5,(5 = 0.815 max 




Figure 8: Coefficient profiles for R-FS e ,5 as a function of the £i~norm of the regression coefficients, for the 
same datasets appearing in Figure [2j For each example, different values of S have been considered. The 
left panel corresponds to the choice S = oo, i.e., FS e . In all the above cases, the algorithms were run for 
a maximum of 100,000 boosting iterations with e = 10~ 4 . [Top Panel] Corresponds to the Prostate cancer 
dataset with n = 98 and p = 8. All the coefficient profiles look similar, and they all seem to coincide with 
the LASSO profile (see also Figure [2]). [Bottom Panel] Shows the Prostate cancer dataset with a subset 
of samples n = 10 with all interactions included with p = 44. The coefficient profiles in this example are 
sensitive to the choice of 6 and are seen to be more constrained towards the end of the path, for decreasing 
6 values. The profiles are different than the LASSO profiles, as seen in Figure [2] The regression coefficients 
at the end of the path correspond to approximate LASSO solutions, for the respective values of S. 

5 A Modified Forward Stagewise Algorithm for Computing the 
Lasso Path 


In Section [4] we introduced the boosting algorithm R-FS^ (which is a very close cousin of FS e ) 
that delivers solutions to the Lasso problem ([2]) for a fixed but arbitrary <5, in the limit as k —> oo 
with e « 0. Furthermore, our experiments in Section [6] suggest that R-FS £i 5 may lead to estimators 
with good statistical properties for a wide range of values of 5, provided that the value of 5 is not 
too small. While R-FS £i ,5 by itself may be considered as a regularization scheme with excellent 
statistical properties, the boosting profile delivered by R-FS £i 5 might in some cases be different 
from the Lasso coefficient profile, as we saw in Figure |8j Therefore in this section we investigate 
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the following question: is it possible to modify the R-FS^ algorithm, while still retaining its basic 
algorithmic characteristics, so that it delivers an approximate LASSO coefficient profile for any 
dataset? We answer this question in the affirmative herein. 


To fix ideas, let us consider producing the (approximate) Lasso path by producing a sequence 
of (approximate) Lasso solutions on a predefined grid of regularization parameter values 5 in the 
interval ( 0 ,5] given by 0 < So < 4 < ... < Sk = S. (A standard method for generating the grid 
points is to use a geometric sequence such as S t = r]~ l ■ So for i = 0,... ,K, for some 77 E (0,1).) 
Motivated by the notion of warm-starts popularly used in the statistical computing literature in 
the context of computing a path of Lasso solutions (55) via coordinate descent methods |23j] , we 
propose here a slight modification of the R-FS £i 5 algorithm that sequentially updates the value of 
5 according to the predefined grid values So, Si, ■ ■ ■ ,Sk = S, and does so prior to each update of f l 
and /3*. We call this method PATH-R-FS e , whose complete description is as follows: 


Algorithm: PATH-R-FS e 

Fix the learning rate e > 0 , choose values Si, i = 0 ,..., K, satisfying 0 < 4 < 4 < ■ ■ ■ < Sk < S 
such that e < So- 

Initialize at f° = y, /§° = 0, k = 0 . 

1. For 0 < k < K do the following: 

2. Compute: jk E argmax |(f fc ) T Xj| 

3. Set: 

r k+ 1 <- f k - e [sgn {{f k ) 2 X Jfc )X Jfc + ( f k - y)/4] 

( X “ + e sgn((r fe ) T X jfc ) and fi k+1 (l - e/4) P} ,j + jk 


Notice that PATH-R-FS e retains the identical structure of a forward stagewise regression type of 
method, and uses the same essential update structure of Step (3.) of R-FS^. Indeed, the updates 
of r k+1 and p k+1 in PATH-R-FS e are identical to those in Step (3.) of R-FS^ except that they 
use the regularization value 4 a t iteration k instead of the constant value of 5 as in R-FS £j( 5 . 


Theoretical Guarantees for PATH-R-FS e Analogous to Theorem 4.1 for R-FS £i 5 , the follow¬ 
ing theorem describes properties of the PATH-R-FS e algorithm. In particular, the theorem provides 
rigorous guarantees about the distance between the PATH-R-FS e algorithm and the Lasso coeffi¬ 
cient profiles - which apply to any general dataset. 


Theorem 5.1. (Computational Guarantees of PATH-R-FS £ ) Consider the PATH-R-FS e 

algorithm with the given learning rate e and regularization parameter sequence {4}- Let k > 0 
denote the total number of iterations. Then the following holds: 

(i) /Lasso feasibility and average training error): for each i = 0, ..., k, j3 l provides an approxi¬ 
mate solution to the Lasso problem for 5 = Si. More specifically, j3 l is feasible for the Lasso 
problem for 5 = Si, and satisfies the following suboptimality bound with respect to the entire 
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boosting profile: 


STISM-%) £ 

»=0 


2 25e 


2ne(k + 1) n 


+ 


(m) (l\-shrinkage of coefficients): ||/T||i < Si for i = 0,..., k. 
(Hi) (sparsity of coefficients): ||/3*||o < i for i = 0,..., k. 


□ 


Corollary 5.1. (PATH-R-FS £ approximates the Lasso path) For every fixed e > 0 and 
k —> oo it holds that: 


lim sup -— 

k —>oo k T 


T T,( L nW‘)-K,S,)< 


i =0 


2 Se 
n 


(and the quantity on the right side of the above bound goes to zero as e ^ 0). 


n 


The proof of Theorem 5.1 is presented in Appendix A.5.1 


Interpreting the computational guarantees Let us now provide some interpretation of the 
results stated in Theorem |5.1| Recall that Theorem 4.1 presented bounds on the distance between 


the training errors achieved by the boosting algorithm R-FS £i 5 and LASSO training errors for a fixed 


but arbitrary 5 that is specified a priori. The message in Theorem 5.1 generalizes this notion to a 
family of Lasso solutions corresponding to a grid of 5 values. The theorem thus quantifies how the 
boosting algorithm PATH-R-FS £ simultaneously approximates a path of Lasso solutions. 


Part (i) of Theorem 5.1 first implies that the sequence of regression coefficient vectors {/?*} is feasible 


along the Lasso path, for the Lasso problem ([2]) for the sequence of regularization parameter values 
{()*}. In considering guarantees with respect to the training error, we would ideally like guarantees 
that hold across the entire spectrum of {<5*} values. While part (i) does not provide such strong 
guarantees, part (i) states that these quantities will be sufficiently small on average. Indeed, for 
a fixed e and as k —> oo, part (%) states that the average of the differences between the training 
errors produced by the algorithm and the optimal training errors is at most . This non-vanishing 
bound (for e > 0) is a consequence of the fixed learning rate e used in PATH-R-FS £ - such bounds 
were also observed for R-FS £)( 5 and FS £ . 


Thus on average, the training error of the model /3 l will be sufficiently close (as controlled by the 
learning rate e) to the optimal training error for the corresponding regularization parameter value 
S t . In summary, while PATH-R-FS £ provides the most amount of flexibility in terms of controlling 
for model complexity since it allows for any (monotone) sequence of regularization parameter values 
in the range (0, J], this freedom comes at the cost of weaker training error guarantees with respect to 
any particular Si value (as opposed to R-FS £) ,5 which provides strong guarantees with respect to the 
fixed value 5). Nevertheless, part (i) guarantees that the training errors will be sufficiently small on 
average across the entire path of regularization parameter values explored by the algorithm. 


30 










6 Some Computational Experiments 


We consider an array of examples exploring statistical properties of the different boosting algorithms 
studied herein. We consider different types of synthetic and real datasets, which are briefly described 
here. 


Synthetic datasets We considered synthetically generated datasets of the following types: 

• Eg-A. Here the data matrix X is generated from a multivariate normal distribution, i.e., for 
each z = 1,..., n, Xj ~ MVN(0, E). Here x, denotes the z th row of X and S = ( a^j ) E M pxp 
has all off-diagonal entries equal to p and all diagonal entries equal to one. The response 
y E is generated as y = X/3 pop + e, where e* ~ N( 0, cr 2 ). The underlying regression 
coefficient was taken to be sparse with /3 pop = 1 for all i < 5 and (3f op = 0 otherwise, a 2 is 
chosen so as to control the signal to noise ratio SNR := Var(x / /3)/cj 2 . 

Different values of SNR, n,p and p were taken and they have been specified in our results 
when and where appropriate. 

• Eg-B. Here the datasets are generated similar to above, with /3 pop = 1 for i < 10 and /3 pop = 0 
otherwise. We took the value of SNR=lin this example. 


Real datasets We considered four different publicly available microarray datasets as described 
below. 


Leukemia dataset. This dataset, taken from 12 , was processed to have n = 72 and 


p = 500. y was created as y = X/3 pop + e; with /3f p = 1 for all i < 10 and zero otherwise. 

Golub dataset. This dataset, taken from the R package mpm, was processed to have n = 73 
and p = 500, with artificial responses generated as above. 


Khan dataset. This dataset, taken from the website of 28 , was processed to have n = 73 
and p = 500, with artificial responses generated as above. 


• Prostate dataset. This dataset, analyzed in 15 , was processed to create three types of 
different datasets: (a) the original dataset with n = 97 and p = 8, (b) a dataset with n = 97 
and p = 44, formed by extending the covariate space to include second order interactions, 
and (c) a third dataset with n = 10 and p = 44, formed by subsampling the previous dataset. 

For more detail on the above datasets, we refer the reader to the Appendix [B] 


Note that in all the examples we standardized X such that the columns have unit £2 norm, before 
running the different algorithms studied herein. 


6.1 Statistical properties of boosting algorithms: an empirical study 

We performed some experiments to better understand the statistical behavior of the different boost¬ 
ing methods described in this paper. We summarize our findings here; for details (including tables, 
figures and discussions) we refer the reader to Appendix, Section [Bj 
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Sensitivity of the Learning Rate in LS-Boost(e) and FS £ We explored how the training 
and test errors for LS-Boost(e) and FS £ change as a function of the number of boosting iterations 
and the learning rate. We observed that the best predictive models were sensitive to the choice of 
e —the best models were obtained at values larger than zero and smaller than one. When compared 
to Lasso , stepwise regression 15 and FSq 15 ; FS £ and LS-BooST(e) were found to be as good 


as the others, in some cases the better than the rest. 


Statistical properties of R-FS £i 5 , Lasso and FS £ : an empirical study We performed some 
experiments to evaluate the performance of R-FS £j ^, in terms of predictive accuracy and sparsity 
of the optimal model, versus the more widely known methods FS £ and Lasso. We found that 
when 6 was larger than the best 5 for the Lasso (in terms of obtaining a model with the best 
predictive performance), R-FS £i< s delivered a model with excellent statistical properties - R-FS £i< 5 
led to sparse solutions and the predictive performance was as good as, and in some cases better 
than, the Lasso solution. We observed that the choice of <5 does not play a very crucial role in 
the R-FS £i 5 algorithm, once it is chosen to be reasonably large; indeed the number of boosting 
iterations play a more important role. The best models delivered by R-FS £( s were more sparse than 
FS £ . 


Acknowledgements 


The authors will like to thank Alexandre Belloni, Jerome Friedman, Trevor Hastie, Arian Maleki 
and Tomaso Poggio for helpful discussions and encouragement. A preliminary unpublished version 
of some of the results herein was posted on the ArXiv 


18 


References 

[1] M. Avriel. Nonlinear Programming Analysis and Methods. Prentice-Hall, Englewood Cliffs, N.J., 1976. 

[2] F. Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimiza¬ 
tion, 25(1):115-129, Jan. 2015. 

[3] D. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999. 

[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. 

[5] L. Breiman. Arcing classifiers (with discussion). Annals of Statistics, 26:801-849, 1998. 

[6] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493—1517, 1999. 

[7] P. Biihlmann. Boosting for high-dimensional linear models. The Annals of Statistics, pages 559-583, 
2006. 

[8] P. Biihlmann and T. Hothorn. Boosting algorithms: regularization, prediction and model fitting (with 
discussion). Statistical Science, 22(4):477-505, 2008. 

[9] P. Biihlmann and B. Yu. Boosting with the 12 loss: regression and classification. Journal of the American 
Statistical Association, 98(462):324-339, 2003. 

[10] P. Biihlmann and B. Yu. Sparse boosting. The Journal of Machine Learning Research, 7:1001-1024, 
2006. 


32 





[11] K. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. 19th ACM-SIAM 
Symposium on Discrete Algorithms, pages 922-931, 2008. 

[12] M. Dettling and P. Biihlmann. Boosting for tumor classification with gene expression data. Bioinfor¬ 
matics, 19(9):1061 1069, 2003. 

[13] D. L. Donoho, I. M. Johnstone, G. Kerkyaclrarian, and D. Picard. Wavelet shrinkage: asymptopia? 
Journal of the Royal Statistical Society. Series B (Methodological), pages 301-369, 1995. 

[14] J. Duchi and Y. Singer. Boosting with structural sparsity. In Proceedings of the 26th Annual Interna¬ 
tional Conference on Machine Learning , pages 297-304. ACM, 2009. 

[15] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression (with discussion). Annals 
of Statistics, 32(2):407-499, 2004. 

[16] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 
3:95-110, 1956. 

[17] R. M. Freund and P. Grigas. New analysis and results for the Frank-Wolfe method, to appear in 
Mathematical Programming, 2014. 

[18] R. M. Freund, P. Grigas, and R. Mazumder. Adaboost and forward stagewise regression are first-order 
convex optimization methods. CoRR, abs/1307.1192, 2013. 

[19] Y. Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256- 
285, 1995. 

[20] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Pro¬ 
ceedings of the Thirteenth International Conference, pages 148-156. Morgan Kauffman, San Francisco, 
1996. 

[21] J. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 
29(5):1189-1232, 2001. 

[22] J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics, 
Stanford University, 2008. 

[23] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of 
Applied Statistics, 2(l):302-332, 2007. 

[24] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting 
(with discussion). Annals of Statistics, 28:337-307, 2000. 

[25] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 
29:1189-1232, 2000. 

[26] J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles. Journal of Machine 
Learning Research, 94305, 2003. 

[27] T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression and the monotone 
lasso. Electronic Journal of Statistics, 1:1-29, 2007. 

[28] T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning: Data Mining, Inference, 
and Prediction. Springer Verlag, New York, 2009. 

[29] T. J. Hastie and R. J. Tibshirani. Generalized additive models, volume 43. CRC Press, 1990. 

[30] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 
30th International Conference on Machine Learning (ICML-13), pages 427-435, 2013. 

[31] S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, 
IEEE Transactions on, 41(12):3397—3415, 1993. 


33 



[32] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. 12:512-518, 

2000 . 

[33] A. Miller. Subset selection in regression. CRC Press Washington, 2002. 

[34] Y. E. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied 
Optimization. Kluwer Academic Publishers, Boston, 2003. 

[35] B. Polyak. Introduction to Optimization. Optimization Software, Inc., New York, 1987. 

[36] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for adaboost. Machine learning , 42(3):287-320, 
2001 . 

[37] S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. 
Journal of Machine Learning Research, 5:941-973, 2004. 

[38] R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197-227, 1990. 

[39] R. Schapire and Y. Freund. Boosting: Foundations and Algorithms. Adaptive computation and machine 
learning. Mit Press, 2012. 

[40] N. Z. Shor. Minimization Methods for Non-Differentiable Functions, volume 3 of Springer Series in 
Computational Mathematics. Springer, Berlin, 1985. 

[41] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 
Series B , 58(l):267-288, 1996. 

[42] J. Tukey. Exploratory data analysis. Addison-Wesley,Massachusetts, 1977. 

[43] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint 
arXiv:1011.3027 , 2010. 

[44] S. Weisberg. Applied Linear Regression. Wiley, New York, 1980. 

[45] P. Zhao and B. Yu. Stagewise lasso. The Journal of Machine Learning Research, 8:2701-2726, 2007. 


34 



A Technical Details and Supplementary Material 

A.l Additional Details for Section 1 

A. 1.1 Figure showing Training error versus ^-shrinkage bounds 

Figure [9] showing profiles of l\ norm of the regression coefficients versus training error for LS- 
BooST(e), FS e and Lasso. 

£i shrinkage versus data-fidelity tradeoffs: LS-Boost(e) , FS e , and Lasso 





shrinkage of coefficients 


£\ shrinkage of coefficients 


t\ shrinkage of coefficients 


Figure 9: Figure showing profiles of £-\ norm of the regression coefficients versus training error for LS- 
Boost(e) , FS £ and Lasso. [Left panel] Shows profiles for a synthetic dataset where the covariates are 
drawn from a Gaussian distribution with pairwise correlations p = 0.5. The true has ten non-zeros with 
/3i = 1 for * = 1,..., 10, and SNR = 1. Here we ran LS-BooST(e) with e = 1 and ran FS e with e = 10 -2 . 
The middle (and right) panel profiles corresponds to the Prostate cancer dataset (described in Section [6]). 
Here we ran LS-BooST(e) with e = 0.01 and we ran FS e with e = 10 -5 . The right panel figure is a zoomed-in 
version of the middle panel in order to highlight the difference in profiles between LS-Boost(e) , FS e and 
Lasso. The vertical axes have been normalized so that the training error at k = 0 is one, and the horizontal 
axes have been scaled to the unit interval (to express the £i-norm of /3 fc as a fraction of the maximum). 


A.2 Additional Details for Section 2 

A.2.1 Properties of Convex Quadratic Functions 

Consider the following quadratic optimization problem (QP) defined as: 

h* := min h(x) := lx T Qx + q 1 x + q° , 

ieR" 2 

where Q is a symmetric positive semi-definite matrix, whereby h(-) is a convex function. We assume 
that Q / 0, and recall that A pm i n ((5) denotes the smallest nonzero (and hence positive) eigenvalue 
of Q. 
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Proposition A.l. If h* > —oo, then for any given x, there exists an optimal solution x* of (QP) 
for which 


Also, it holds that 


I * 11 __ 

\X — X 2 < 


1 2 (h(x) — h* 
^pm in (Q) 


l|VM*)ll2> 



pmin 


(Q) • (h(x) ~ h*) 

2 


Proof: The result is simply manipulation of linear algebra. Let us assume without loss of generality 
that q° = 0. If h* > —oo, then (QP) has an optimal solution x* , and the set of optimal solutions 
are characterized by the gradient condition 

0 = Vh(x) = Qx + q . 

Now let us write the sparse eigendecomposition of Q as Q = PDP r where D is a diagonal matrix 
of non-zero eigenvalues of Q and the columns of P are orthonormal, namely P T P = I. Because 
(QP) has an optimal solution, the system of equations Qx = —q has a solution, and let x denote 
any such solution. Direct manipulation establishes: 

PP T q = —PP t Qx = -PP t PDP t x = -PDP T x = -Qx = q . 

Furthermore, let x := —PD~ 1 P J q. It is then straightforward to see that x is an optimal solution 
of (QP) since in particular: 


Qx = —PDP T PD~ 1 P T q = —PP T q = -q , 

and hence 


h* = \x T Qx + q T x = -\x T Qx = -\q T PD~ 1 P T PDP T PD~ 1 P T q = -\q T PD~ 1 P' 1 


q ■ 


Now let x be given, and define x* := [I-PP T ] x—PD 1 P T q. Then just as above it is straightforward 
to establish that Qx* = —q whereby x* is an optimal solution. Furthermore, it holds that: 

||x — x* ||2 = (q T PD~ 1 +x T P)P T P(D~ 1 P T q + P T x) 

= (q T PD~5 + x T PD^)D~ 1 (D~^P T q + D^P T x) 


< x J n ( Q) ( q T PD 2 + x t PD 2)(D 2P T q + D2P T x) 
= x^tiQ) ( q T PD~ l P T q + x T PDP T x + 2 x T PP T q) 

= Wo)^ 2h * + xTQx + 2xTq ^ 

= A pmi 2 n (Q)( /l (‘ T ) “ h *) > 

and taking square roots establishes the first inequality of the proposition. 
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Using the gradient inequality for convex functions, it holds that: 


h* = h{x*) > h(x) + Vh(x) T (x* — x) 


> h(x) - ||V/i(x)|| 2 ||x* -x|| 2 

> h(x) - ||v/ t (x)||> 

and rearranging the above proves the second inequality of the proposition. 


□ 


A.2.2 Proof of Theorem 12.11 


We first prove part (i). Utilizing (12), which states that r fc+1 = r k — e Xj fe ) X Jfc , we 

have: 

Ln0 t+1 ) = 0\\r t+I U 




Jk II2 


2^11^11! - k £ ((r k ) T * jk y + ske 2 (( r k ) T X jk y 


= L n 0 k )-^e(2-e)((f k ) T X jk y 


(25) 


k\ ||2 

OO ’ 


= L n ((3 k )~ ±£(2-e)n 2 \\VL n ((3‘ 

(where the last equality above uses ®). which yields: 

L n 0 k+1 ) -L* n = L n 0 k ) - L* - f£(2 - e)\\VL n 0 


k mi 2 

OO 


(26) 


We next seek to bound the right-most term above. We will do this by invoking Proposition A.l[ 
which presents two important properties of convex quadratic functions. Because L n (-) is a convex 
quadratic function of the same format as Proposition A.l with h(-) 4— L n (-), Q <— ^X T X, and 
h* 4— A*, it follows from the second property of Proposition A.l that 


I VL r 


Therefore 


2 > 


I VL r 


'Ap„m(±X r X)(L„(0) - L;) - L;) 




z > 
2 — 


A 


pmm 


2 n 


X T X)(L n (/3) - L* 


2 np 


Substituting this inequality into (26) yields after rearranging: 

rT-< 


ok+l\ 


£ / g(2 — £)A pmin (X r X) 


Now note that L n (j3°) = L n ( 0 ) = 277 l|yII 2 an d 



4 p 

2 and 


= 

UMl 


= (L n (/3 k )-L* n )- 1 


(27) 


Ln 0 °)- L* n = ^||y|||-okl|y-X/3 LS ||| = ^||y|||-^(||y|||-2y T X/3 LS + ||X/3 LS |||) = , 
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where the last equality uses the normal equations ([6]). Then (i) follows by using elementary induc¬ 
tion and combining the above with (27): 

Ln0 k ) -L* n < (L n 0°) - L * n ) • = i||X/§ LS ||| • 7 fc • 


To prove (ii), we invoke the first inequality of Proposition A.l, which in this context states 
that 


- /3ls 112 < 


2{L n m-L* n ) J2n(L n (f3 k ) - L* n ) 


A pmin (^X r X) VW( XTX ) 


Part (ii) then follows by substituting the bound on (L n (f3 k ) — L* n ) from (i) and simplifying terms. 


Similarly, the proof of (Hi) follows from the observation that ||X/3 fc — X/3 ls11 2 = y 2n(L n (/3 fc ) — L*) 
and then substituting the bound on ( L n ((3 k ) — L*) from (i) and simplifying terms. 

To prove (iv), define the point /3 fc := (5 k + Uj k &j k • Then using similar arithmetic as in (25) one 


obtains: 


Jk ° 3k • 
yk\ _ j (ak\ 


L* n < L n (p k ) = L n (p k ) - ±v? jk , 


where we recall that Uj k = ( r k ) T ~Kj k . This inequality then rearranges to 


\u jk \ < y/2 n(L n 0 k )-L*J < ||X/3 LS || 2 • 0 /2 , (28) 

where the second inequality follows by substituting the bound on ( L n (f3 l ) — L*) from (i). Recalling 
0) and ([IT}), the above is exactly part (iv). 

Part (v) presents two distinct bounds on ||/3 fc ||i, which we prove independently. To prove the first 
bound, let /3ls be any least-squares solution, which therefore satisfies ([b]). It is then elementary to 
derive using similar manipulation as in ( |25[ ) that for all i the following holds: 

||X(/T +1 - 0ls)H1 = ll x (/^ - /5ls ) 111 - (2e - s 2 )ul (29) 

which implies that 
k -1 

(2e - e 2 ) u\ = ||X(/3° - 3ls)||| - ll x (^ - hs)\\l = l|X/3 LS ||l - ||X(/5 fc - /3 LS )||| . (30) 

i =0 

Then note that 

||/3 fc ||i < \\(suj 0 ,..., £Uj k -i)\\i < \fke\\(u jo ,... ,u jk _ x )\\ 2 = jx^LsTl - II x /3ls - X/3 fc || \ , 

where the last equality is from ( |30[ ). 

To prove the second bound in (v), noting that /3 fc = ^^Tq 1 £ ^ji e jii we bound ||/3 fc ||i as follows: 

J~ _^ 

\\P k \\i < ^ 11 X/3 ls 11 2 ^7* /2 

i =0 i=0 

_ eWXfosh (i .,k/2\ 

1-^7 V 7 ) ' 
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where the second inequality uses (28) for each i £ ( 0 ,..., & — 1 } and the final equality is a geometric 
series, which completes the proof of (v). Part (vi) is simply the property of LS-BooST(e) that 
derives from the fact that f3° := 0 and at every iteration at most one coordinate of /3 changes status 


from a zero to a non-zero value. 


□ 


A.2.3 Additional properties of LS-Boost(e) 

We present two other interesting properties of the LS-BooST(e) algorithm, namely an additional 
bound on the correlation between residuals and predictors, and a bound on the ^-shrinkage of the 
regression coefficients. Both are presented in the following proposition. 

Proposition A.2. (Two additional properties of LS-Boost(e)) Consider the iterates of the 
LS-BooST(e) algorithm after k iterations and consider the linear convergence rate coefficient 7 : 

_ A e(2 - £)A pmin (X T X) 

' ' V 4 p 



(i) There exists an index i £ {0,..., k} for which the l c 
squares loss function evaluated at satisfies: 


norm of the gradient vector of the least 


|VL„(d l )||oo = 4||X T P" 


< 


|X/3 iS ||l-||X/3 iS -X/^+ 1 | 


ny/e (2 - e)(k + 1 ) 


,i||X/3 iS || 2 . 7 fe / 2 \ . (31) 


(ii) Let Ji denote the number of iterations of LS-BooST(e) , among the first k iterations, where the 
algorithm takes a step in coordinate £, for £ = 1,... ,p, and let J max := max{Ji,..., J p }. Then the 
following bound on the shrinkage of j3 k holds: 

\W k \\2 < V^y^^llX^III- ||X/3 iS -X/3 fc ||| . (32) 

□ 


Proof. We first prove part (%). The first equality of (31) is a restatement of For each 

from plj ). 


i £ {0,...,fc}, recall that u,j i = (f*) T X Ji and that = |(f*) T XjJ = ||X 
Therefore: 


rT~i I 


mm | u 
«e{o,...,fc} 


•]i 


= min uf < 


1 


ie{o,...,fc} 


3i — 


k + 1 ^ 




i =0 


||X/3 LS ||1 - ||X(/3 fc + 1 ^ /3 LS )||; 
e (2 — e){k + 1 ) 


(33) 


where the final inequality follows from (30) in the proof of Theorem 2.1 Now letting i be an index 
achieving the minimum in the left hand side of the above and taking square roots implies that 


|X T f l || 0O = |Ujtl < 


||X/3 LS ||i-||X/j LS -X/fo+ 1 ||l 
y/ £ (2 — s)(k + 1 ) 
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the left hand side of (33) yields: 


which is equivalent to the inequality in (31) for the first right-most term therein. Directly applying 


(28) from the proof of Theorem 2.1 and using the fact that i is an index achieving the minimum in 

k/2 


X^lloo = |UjJ < \uj k \ < 11X/5 LS 112 ( 1 — 


e(2 — e)A pm i n (X X 
4 p 


which is equivalent to the inequality in (31) for the second right-most term therein 


We now prove part (ii). For fixed k > 0, let denote the set of iteration counters where 

LS-BooST(e) modifies coordinate i of f3, namely 

J{1) := {i : i < k and ji = i in Step (2.) of Algorithm LS-Boost(e) } , 

for l = 1 , ... ,p. Then Ji = and the sets ^7(1),. .., J{p) partition the iteration index set 

{0,1,..., k — 1}. We have: 

\\P% < ll(Etej'(i) efi ii»---.Ete 1 7(p)e«i i )l|2 

£ | .... «?,) 

7 £\/ Jraa 


Eiej(i) Uji’ ■ ■ ■> \/12iej{p) &j t 


(34) 


= Sy/Jn 


4 


4-i 


and the proof is completed by applying inequality (30). 


□ 


Part (i) of Proposition A.2 describes the behavior of the gradient of the least squares loss func¬ 
tion — indeed, recall that the dynamics of the gradient are closely linked to that of the LS- 
BooST(e) algorithm and, in particular, to the evolution of the loss function values. To illustrate 
this connection, let us recall two simple characteristics of the LS-BooST(e) algorithm: 


L n m-L n {^) = |e(2 — e)||VL n (/3 fc )|| 


2 

oo 


|“1 _ rpk 


-e{(r k ) T X jk )X jk , 


which follow from (26) and Step (3.) of the FS e algorithm respectively. The above updates of 
the LS-BooST(e) algorithm clearly show that smaller values of the norm of the gradient slows 
down the “progress” of the residuals and thus the overall algorithm. Larger values of the norm of 
the gradient, on the other hand, lead to rapid “progress” in the algorithm. Here, we use the term 
“progress” to measure the amount of decrease in training error and the norm of the changes in 
successive residuals. Informally speaking, the LS-BooST(e) algorithm operationally works towards 
minimizing the unregularized least squares loss function — and the gradient of the least squares 


loss function is simultaneously shrunk towards zero. Equation (31) precisely quantifies the rate at 


which the l ^ norm of the gradient converges to zero. Observe that the bound is a minimum of two 
distinct rates: one which decays as O () and another which is linear with parameter ^/y. This is 
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similar to item (v) of Theorem 2.1 For small values of k the first rate will dominate, until a point 
is reached where the linear rate begins to dominate. Note that the dependence on the linear rate 
7 suggests that for large values of correlations among the samples, the gradient decays slower than 
for smaller pairwise correlations among the samples. 


The behavior of the LS-BooST(e) algorithm described above should be contrasted with the FS £ 
algorithm. In view of Step (3.) of the FS e algorithm, the successive differences of the residuals in 
FS e are indifferent to the magnitude of the gradient of the least squares loss function — as long 
as the gradient is non-zero, then for FS e it holds that — r k ||2 = £• Thus FS e undergoes a 

more erratic evolution, unlike LS-BooST(e) where the convergence of the residuals is much more 
“smooth.” 


A. 2.4 Concentration Results for A pm i n (X T X) in the High-dimensional Case 

Proposition A. 3. Suppose that p > n, let X E M nxp he a random matrix whose entries are i.i.d. 
standard normal random variables, and define X := -^X. Then, it holds that: 

E[A pmin (X r X)] > Vn) 2 • 

Furthermore, for every t > 0, with probability at least 1 — 2exp(— 1 2 /2), it holds that: 


A pm in(X r X) > - {y/p -y/n-t) 


n 


Proof. Let (t 1 (X t ) > 07 (X T ) > ... > cr n (X T ) denote the ordered singular values of X r (equiva¬ 
lently of X). Then, Theorem 5.32 of 43 states that: 

E[a n (X 7 )] > y/p- yfn , 

which thus implies: 

E[A pmin (X T X)] = E[(cr n (X T )) 2 ] > (E[cr n (X T )]) 2 = —(E[u n (X T )]) 2 > - {y/p - , 


n 


n 


where the first inequality is Jensen’s inequality. 


Corollary 5.35 of 43 states that, for every t > 0, with probability at least 1 — 2exp(—f 2 /2) it holds 
that: 

OV^X 7 ) > y/p - y/n - t , 

which implies that: 

A pmin (X T X) = (a n (X T )) 2 = -K(X T )) 2 > - (y/p - y/fi - t) 2 . 

n n 

□ 


Note that, in practice, we standardize the model matrix X so that its columns have unit £2 norm. 
Supposing that the entries of X did originate from an i.i.d. standard normal matrix X, standardiz¬ 
ing the columns of X is not equivalent to setting X := ^yX. But, for large enough n, standardizing 

is a valid approximation to normalizing by -4=, i.e., X ~ -4=X, and we may thus apply the above 

y/n y/n 

results. 
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A.3 Additional Details for Section 3 


A.3.1 An Elementary Sequence Process Result, and a Proof of Proposition |3.l] 


Consider the following elementary sequence process: x° E M n is given, and x l+l •<— x l — ong 1 for 
all i > 0, where g l E M n and a, is a nonnegative scalar, for all i. For this process there are no 
assumptions on how the vectors g 1 might be generated. 


Proposition A.4. For the elementary sequence process described above, suppose that the { g *} are 
uniformly bounded, namely \\g l \\2 < G for all i > 0. Then for all k > (1 and for any x E R n it holds 
that: 


E k, 

2 —I 


I 


k 

£ 

i=0 


oti(g l ) T (x l -x) < 


\x — X 


\o + G 2 


2^i =o 


O.A 


2 E-=, 


i Oii 


(35) 


Indeed, in the case when ccj = e for all i, it holds that: 


1 

k + 1 


k 


£(s‘)V 


||x° — x\\ 2 G 2 £ 

x) 5 fTIjlE 


(36) 


Proof. Elementary arithmetic yields the following: 

||x * +1 — x \\ 2 = \\x l — aig 1 — x \\ 2 2 


= Ik* - x \\l + a i 115*111 + - x l ) 

< \\x l — x\\ 2 + G 2 a 2 + 2ai(g l ) T (x — x l ) . 


Rearranging and summing these inequalities for i = 0,..., k then yields: 

k k k 


2 ^ ai(g l ) T (x l — x) < G 2 ^ a 2 + ||x° — x||| — ||x fe+1 — x||| < 


+ ll x ° “ ®lli , 


i=0 


i=0 


t =0 


which then rearranges to yield (35). (36) follows from (35) by direct substitution 


□ 


Proof of Proposition 3.1: Consider the subgradient descent method (19) with arbitrary step-sizes 


a.i for all i. We will prove the following inequality: 


min fix 1 ) < f* + 
ie{0,...,k} 


ar — x 


+ G 2 Et 


0 a i 


2 Ei= 0 a i 


(37) 


from which the proof of Proposition |3.1| follows by substituting a, = a for all i and simplifying 


terms. Let us now prove (37). The subgradient descent method (19) is applied to instances of 
problem ( |17[ ) where /(•) is convex, and where g l is subgradient of /(•) at x l , for all i. If x* is an 
optimal solution of it therefore holds from the subgradient inequality that 


f* = f{x*) > f(x z ) + ig l ) T ix - x l ) . 
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Substituting this inequality in (35) for the value of x = x* yields: 


\ x °- x *\\ 2 2 + G 2 £?=o a? 




A.3.2 Proof of Theorem 13.11 


1 . ^ 

- x l -x*) 


- TT~\k Z_ 

zLi=0 a i i =0 


--5^ai(/(x*)-D > . min /(x*) . 

-- »e{o,...,fc} 


— z_ 

Z^i=o a * i=o 


□ 


We first prove part (%). Note that item (i) of Proposition |3.2| shows that FS e is a specific instance of 
subgradient descent to solve problem using the constant step-size e. Therefore we can apply the 
computational guarantees associated with the subgradient descent method, particularly Proposition 
to the FS f algorithm. Examining Proposition|3.1t we need to work out the corresponding values 


3.1 


of /*, ||x u — x* || 2 , a, and G in the context of FS e for solving the CM problem ( |21[ ). Note that /* = 0 
for problem (21). We bound the distance from the initial residuals to the optimal least-squares 


residuals as follows: 


|r° - r*|| 2 = ||r° - r LS || 2 = ||y - (y - X/3 L s)||2 = ||X/3 L s||2 • 


From Proposition 3.2 part (%) we have a = e. Last of all, we need to determine an upper bound G 
on the norms of subgradients. We have: 

||/|| 2 = ||sgn((f fc ) r X i JX jfc || 2 = ||X ife || 2 = 1 , 

since the covariates have been standardized, so we can set G = 1. Now suppose algorithm FS £ is 
run for k iterations. Proposition 3.1 then implies that: 

|X T f*||oo = min f(r l ) < f* + 


~o _ 112 


mm 




2 a(k + 1) 


2 aG 2 

+ 


I x:i,,s £ 

2 e(k + 1) 2 ' 


(38) 


The above inequality provides a bound on the best (among the first k residual iterates) empirical 
correlation between between the residuals f l and each predictor variable, where the bound depends 
explicitly on the learning rate e and the number of iterations k. Furthermore, invoking Q, the 
above inequality implies the following upper bound on the norm of the gradient of the least squares 
loss L n (-) for the model iterates {/T} generated by FS e : 


min HVL^IU < l|X/ ^ IS + f . 
ie{o,...,fc} 2ne(A:-|-l) 2 n 


(39) 


Let i be the index where the minimum is attained on the left side of the above inequality. In 
a similar vein as in the analysis in Section [2j we now use Proposition A.l which presents two 
important properties of convex quadratic functions. Because L n (-) is a convex quadratic function 
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of the same format as Proposition A.l with /i(-) •(— L n (-), Q 


from the second property of Proposition A.l| that 
|| VL n {p)\\ 2 > 


±X T X, and h* 

n ’ 


L*, it follows 


Apriiin ( X T X) (L n (/3* ) - L*) , A pmin (X^X)(L n (^) - L*) 


2 n 


where recall that A pm i n (X i X) denotes the smallest non-zero (hence positive) eigenvalue of X 1 X 


Therefore 


|VT n (^)||^>i||VL n (^)||^> 


A pmin (X i X)(L n (/T) - L* n ) 
2 np 


Substituting this inequality into (39) for the index i where the minimum is attained yields after 
rearranging: 

P 


L n (F)-L* n < 


2nA pmin (X T X) 


IjXfosjjl 

e(fc + 1) 


+ £ 


(40) 


whic h proves part (%). The proof of part (ii) follows by noting from the first inequality of Proposition 

at! 


that there exists a least-squares solution (3* for which: 

ir-/3ii 2 < 


l2(L n (F)-L* n 


' 2n(L n (/3®) — L* 


A 


pmm 


(£X r X) 


< 


Vp 


^pmin (X T X) ~ A pmin (X r X) 


s(k + 1) 


+ £ 


where the second inequality in the above chain follows using (40). The proof of part (Hi) follows 


by first observing that ||X(/3* — /3ls) ||2 = y 2n(L n (/3*) — L*) and then substituting the bound on 

(. L n ((3 l ) — L* n ) from part (i) and simplifying terms. Part (iv) is a restatement of inequality (38). 
Finally, parts (v) and (vi) are simple and well-known structural properties of FS e that are re-stated 
here for completeness. □ 


A.3.3 


A deeper investigation of the computational guarantees for LS-Boost(e) and 
FS e 


Here we show that in theory, LS-BooST(e) is much more efficient than FS £ if the primary goal is 
to obtain a model with a certain (pre-specified) data-fidelity. To formalize this notion, we consider 
a parameter r E (0,1]. We say that j3 is at a r-relative distance to the least squares predictions if 
]3 satisfies: 


||X^-X/3 ls || 2 <t||X/ 3 ls ||2 . 


(41) 


Now let us pose the following question: if both LS-BooST(e) and FS e are allowed to run with 
an appropriately chosen learning rate e for each algorithm, which algorithm will satisfy (41) in 
fewer iterations? We will answer this question by studying closely the computational guarantees 
of Theorems 2.1 and 3.1 Since our primary goal is to compute f3 satisfying (41), we may optimize 


the learning rate £, for each algorithm, to achieve this goal with the smallest number of boosting 
iterations. 


Let us first study LS-BooST(e). As we have seen, a learning rate of e = 1 achieves the fastest 
rate of linear convergence for LS-BooST(e) and is thus optimal with regard to the bound in part 
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(in) of Theorem 


2.1 


If we run LS-BooST(e) with e = 1 for /TS-Boost(e) ;= 


4 p 


ln (tO 


, pmm (X r X) 

iterations, then it follows from part (in) of Theorem 2.1 that we achieve ED- Furthermore, it 
follows from (|23l) that the resulting ^-shrinkage bound will satisfy: 


Sbound ls “ Boost ( £) < ||X/3 ls 11 2 %/A; LS “ BooST ( £ ) . 


For FS e , if one works out the arithmetic, the optimal number of boosting iterations to achieve (|41 ) 

4 p 


is given by: k FSe : = 


,(X T X) 




— 1 using the learning rate e = |^=== . 


from part (v) of Theorem 3.1 that the resulting shrinkage bound will satisfy: 


Also, it follows 


Sbound FSe < g • k FS * » 11 X/3 ls 11 2 • . 


Observe that ^ l S-Boost(e) < k FS e ^ w hereby LS-Boost(£) is able to achieve (41) in fewer iterations 
than FS £ . Indeed, if we let rj denote the ratio fc LS - BooST ( £ ) /k FSe , then it holds that 


V '■= 


fcLS-B00ST( £ ) i n (J_) 


k FS s 


1 


1 

< 


< 0.368 


(42) 


The left panel of Figure [lO] shows the value of 77 as a function of r. For small values of the 
tolerance parameter r we see that r/ is itself close to zero, which means that LS-BooST(e) will need 


significantly fewer iterations than FS e to achieve the condition (41). 


We can also examine the Ki-shrinkage bounds similarly. If we let i? denote the ratio of 
Sbound ls - Boost ( £ ) to SBOUND FSe , then it holds that 


i? := 


Sbound ls - Boost ( £ ) 

Sbound FSe 


^.LS-Boost(e) 




k FS « 


1 

7* 


< — < 0.607 . 


(43) 


This means that if both bounds are relatively tight, then the t^-shrinkage of the final model pro¬ 
duced by LS-Boost(£) is smaller than that of the final model produced by FS e , by at least a factor 
of 0.607. The right panel of Figure [To| shows the value of i? as a function of r. For small values of 
the relative predication error constant r we see that i? is itself close to zero. 

We summarize the above analysis in the following remark. 

Remark A.l. (Comparison of efficiency of LS-Boost(e) and FS e ) Suppose that the primary 
goal is to achieve a t- relative prediction error as defined in and that LS-Boost(£) and FS £ 

are run with appropriately determined learning rates for each algorithm. Then the ratio of required 


number of iterations of these methods to achieve (41) satisfies 


^LS-Boost(e) 

rj := -r-pc- < 0.368 . 

k Fb ? 
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Figure 10: Plot showing the value of the ratio 77 of iterations of LS-BooST(e) to FS e (equation ( |42| )) versus 
the target relative prediction error r [left panel], and the ratio 7 ? of shrinkage bounds of LS-Boost(£) to 
FS e (equation (43)) versus the target relative prediction error r [right panel]. 


Also, the ratio of the shrinkage bounds from running these numbers of iterations satisfies 


_ Sbound ls - Boost ^) 

d :=-pc- < 0.607 , 

Sbound^ 


where all of the analysis is according to the bounds in Theorems 3.1 and\2.1\ 


We caution the reader that the analysis leading to Remark A.l is premised on the singular goal 


of achieving (41) in as few iterations as possible. As mentioned previously, the models produced 


in the interior of the boosting profile are more statistically interesting than those produced at the 
end. Thus for both algorithms it may be beneficial, and may lessen the risk of overfitting, to trace 
out a smoother profile by selecting the learning rate e to be smaller than the prescribed values 
this subsection (e = 1 for LS-BooST(e) and e = for FS £ ). Indeed, considering just 


m 


LS-BooST(e) for simplicity, if our goal is to produce a r-relative prediction error according to (41) 
with the smallest possible t\ shrinkage, then Figure [3] suggests that this should be accomplished by 
selecting e as small as possible (essentially very slightly larger than 0 ). 


A.4 Additional Details for Section 4 

A.4.1 Duality Between Regularized Correlation Minimization and the Lasso 


In this section, we precisely state the duality relationship between the RCM problem (24) and the 


Lasso. We first prove the following property of the least squares loss function that will be useful 
in our analysis. 


Proposition A.5. The least squares loss function L„(-) has the following max representation: 


L n (/3) = max {-r T (^X)(5 - A||f - y||| + ±\\y\\ 2 2 } 

TkzJ~^res 


(44) 


where P res := {r £ M n : r = y — X/3 for some ft £ M p }. Moreover, the unique optimal solution (as 


a function of f3) to the subproblem in (44) is r := y — X/3. 
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Proof. For any (3 E R p , it is easy to verify through optimality conditions (setting the gradient with 
respect to r equal to 0) that r solves the subproblem in (44), i.e., 

f = arg max {-r T (^X)/3 - ±\\r - y||| + ^||y|||} ■ 

^tires 

Thus, we have 


max |-f T (iX)/? - i ||f - y III + akllylll} = £ (§l|y||l - y T x/? + llix^lll) 

' res 


= illy-x/ 


□ 


The following result demonstrates that RCM (24) has a direct interpretation as a (scaled) dual of 
the Lasso problem ([2]). Moreover, in part (in) of the below Proposition, we give a bound on the 
optimality gap for the Lasso problem in terms of a quantity that is closely related to the objective 
function of RCM. 

Proposition A.6. (Duality Equivalence of Lasso and RCM d , and Optimality Bounds) 

The Lasso problem ([2]) and the regularized correlation minimization problem RCM ,5 (24) are dual 
optimization problems modulo the scaling factor j. In particular: 

(i) (Weak Duality) If (I is feasible for the Lasso problem Q, and ifr is feasible for the regularized 
correlation minimization problem RCM ,5 (|24|) ; then 


LnW + ifsif ) > 4||y||| • 


(ii) (Strong Duality) It holds that: 


L * n ,s + ifs - ^llyll! • 


Moreover, for any given parameter value 5 > 0, there is a unique vector of residuals 
associated with every Lasso solution (3$, i.e., f| = y — X/3|, and is the unique optimal 


solution to the ROMs problem (24). 


(Hi) (Optimality Condition for Lasso ) If (3 is feasible for the Lasso problem and r = y — ~K(3, 
then 

r T X/3 


u>s(/3) := ||X i r||oo - 


>0 , 


(45) 


and 


Ln(P)-L* niS < 1-U S (P) . 

Hence, if us((d) = 0, then (3 is an optimal solution of the Lasso problem Q. 


n 


Proof. Let us first construct the problem RCM^ using basic constructs of rninmax duality. As 


demonstrated in Proposition A.5 the least-squares loss function L n (•) has the following max rep¬ 
resentation: 

L n(P) = max {-r T (^X)/?- ^ ||f - y ||| + ^ ||y |||} . 

^fcires 
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Therefore the Lasso problem ([ 2 ]) can be written as 


min max f T (-X)/3 — J-||r — y||, + jjMlylln) 

/3eBf r£P Tes X Kn 2n " ^ 2 2n 11^112/ 

where B$ :={/3gM p : ||/3 ||i < <5}. We construct a dual of the above problem by interchanging the 

min and max operators above, yielding the following dual optimization problem: 

max min { —f T (-X)d — 7 ^-Ilf — yllo + llyllo) . 

fePres /3eB f x Kn 2n 11 ^ 112 2r»ll^ll2/ 

After negating, and dropping the constant term ^||y|| 2 , the above dual problem is equivalent to: 

mm “ax (f T (^X)/3} + ^||f - y||| . (46) 

T’fc-'res pt-05 


Now notice that 


P&B S 


= h ( max , \r T Xj\) = £ll x 




(47) 


from which it follows after scaling by j that (46) is equivalent to (24). 

Let us now prove item (i). Let j3 be feasible for the Lasso problem ([ 2 ]) and r be feasible for the 
regularized correlation minimization problem RCM^ (24), and let r = y — X/3 and let /3 be such 
that r = y — X/3. Then direct arithmetic manipulation yields the following equality: 

f T X/L 


L M + ifs(r) = &||y||! + Mr - rg + £ ||X T f^ - 


from which the result follows since ||r — f|| 2 > 0 and r T X/3 < ||X r r 
implies that the last term above is also nonnegative. 


(48) 


|i < <5 11 X^-r 11 oo which 


To prove item (ii), notice that both the Lasso and RCM^ can be re-cast as optimization problems 
with a convex quadratic objective function and with linear inequality constraints. That being 
the case, the classical strong duality results for linearly-constrained convex quadratic optimization 
apply, see (l| for example. 

We now prove (iii). Since /3 is feasible for the Lasso problem, it follows from the Holder inequality 
that r T X/3 < ||X T r|| 00 ||/3||i < 5||X T ?’|| 00 , from which it then follows that lus((3) > 0. Invoking (48) 
with r <— r = y — X/3 yields: 

L n(P) + {f&{r) = ^||y||! + { ■ us(P) ■ 

Combining the above with strong duality (ii) yields: 

Ln(P) + ifs(r) = L* n j + + £ • u s (P) ■ 

After rearranging we have: 

LniP) - L* }S < 5 n n - ifs(r) + f • CJg(P) < n • MP) , 

where the last inequality follows since ft < fs(r). □ 
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A.4.2 Proof of Proposition 4.1 


Recall the update formula for the residuals in R-FS e ^: 


f k+l 


sgn((r 


k\T 




( 49 ) 


We first show that g k : = sgn ((f fc ) i X ?fc )Xj fc + |(f fe — y) is a subgradient of fs(-) at f . Recalling 
the proof of Proposition 


3.2 


we have that sgn((r fc ) i ~Kj k )X.j k is a subgradient of /(r) := ||X r rj 


at r k since jk argmax J - e r 1> |(f fc ) T Xj|. Therefore, since f$(r ) = f(r) + ^||r — y|||, it follows 
from the additive property of subgradients (and gradients) that g k = sgn((f fc ) T Xj fc )X Jfc + r k — y) 
is a subgradient of f$(r ) at r = r k . Therefore the update (49) is of the form f k+1 = r k — eg k where 


g k G dfs(f k ). Finally note that f k —eg k = f k+1 = y—X/3 fc+ ME P res , hence II p ies {r k —£g k ) = r k —eg k , 
i.e., the projection step is superfluous here. Therefore f k+1 = Flp res (f fc — eg k ), which shows that 


(49) is precisely the update for the subgradient descent method with step-size a*, := s. □ 


A.4.3 Proof of Theorem 14.11 


Let us first use induction to demonstrate that the following inequality holds: 

k -1 

0 k \\i lY for all fc > 0 . 

1=o 


(50) 


Clearly, (50) holds for k = 0 since (3° = 0. Assuming that (50) holds for k , then the update for 
/3 k+1 in step (3.) of algorithm R-FS £) 5 can be written as /3 k+ *~= (1 — |)/3 fc + e ■ sgn((f fc ) r Xj fc )e Jfc , 
from which it holds that 


3fc+lii _ 


i = IK 1 - f)/3 fc + e-sgn((f /c ) T X i Je i J|i 


k\T^ 


< (! - l)\\P k \\i + £\\e jk \\i 


k—1 

1=0 


which completes the induction. Now note that (50) is a geometric series and we have: 

k -1 


l.< (1-1)4= * l-(l-J)' 

l=o 


< 6 for all k > 0 . 


(51) 


Recall that we developed the algorithm R-FS £i ,s in such a way that it corresponds exactly to an 


instantiation of the subgradient descent method applied to the RCM problem (24). Indeed, the 
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update rule for the residuals given in Step (3.) of R-FS £) <5 is: f k+l •$— f k — sg k where g k = 


[sgn ((r k ) T X jk ) X j k + \ {r k — y)] • We therefore can apply Proposition A. 4l and more specifically 


the inequality (36). In order to do so we need to translate the terms of Proposition A.4 to our 
setting: here the variables x are now the residuals r, the iterates x l are now the iterates r l , etc. 
The step-sizes of algorithm R-FS £i< 5 are fixed at e, so we have o/j = e for all i > 0. Setting the value 
of x in Proposition A.4 to be least-squares residual value, namely x = f^s-, the left side of (36) is 
therefore: 


fcTT EjLoteTV - x ) = FFl£i=o( X sgn((f i )' r X ji )e ii - ) (r* - f LS ) 


A\Ti 


1 

fc +1 


~i\T^ 


1 hi 


dl Elo (sgii((r i ) T X„)X,, - J(X^)) 


_ 1 

k -\-1 2 si =0 


/c+1 Z-^i=0 


|X T P| 


- W) T ^ 1 


= ^rE-=o MP) 


(52) 


where the second equality uses the fact that X^f^g = 0 from ([ 6 ]) and the fourth equality uses the 
definition of u)s(/3) from (45). 


Let us now evaluate the right side of (36). We have ||.t° — aj ||2 = Hr 0 — tls \\2 = ||y — (y — X^ls)II 2 = 
||X/3 L s|| 2 . Also, it holds that 

\Wh = ||sgn((f i ) T X ii )X ji -i(X ^)|| 2 < 11 X ^112 + ||X(f )|| 2 < l + i||X|| 1 ) 2 ||4 i || 1 < 1 + ||X || 1)2 < 2 , 


where the third inequality follows since 


< 5 from (51) and the second and fourth inequalities 


follow from the assumption that the columns of X have been normalized to have unit £ 2 norm. 
Therefore G = 2 is a uniform bound on ||g*|| 2 . Combining the above, inequality (36) implies that 
after running R-FS £) 5 for k iterations, it holds that: 


min u 5 {fr) < 
ie{o,...,fc} k + 1 

*=0 


< 


jjXftgjlf 

2(k + l)e 


+ 


2 2 s 


2e{k + 1 ) 


+ 2e , 


(53) 


where the first inequality is elementary arithmetic and the second inequality is the application of 
(36). Now let i be the index obtaining the minimum in the left-most side of the above. Then it 
follows from part (iii) of Proposition A .6 that 


r 1/9*1 T* < ^ 1/9*1 < ^11 || 2 . ^ 

M/n-in,* < 2n£(lt+1) + — 


(54) 


which proves item (i) of the theorem. 

To prove item (ii), note first that if /3| is a solution of the Lasso problem (|2j), then it holds that 
||/3|||i < 5 (feasibility) and tos{j3* & ) = 0 (optimality). This latter condition follows easily from 
the optimality conditions of linearly constrained convex quadratic problems, see 11 for example. 
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Setting fg = y — X/3|, the following holds true: 


||X/3* — X/3^ 11 2 = 2 n[L n (P)-L n (/3*) + (r* s yX(F-(3* s 


2 n (L n {fr) - L* nd - 511X^1 IU + (f|) T X^) 


< 2n (L n (/T) - L* n & - ailX^lloo + ||X T f 


illooiiP'iU 


< 2n ( L n (/T) - - 5||X T f||| 0O + ^llX^'rllU 


r T | 


= 2 n(L n {fo-L* ntS ) 


< 


*H x &sllg , 

e(fe+l) + 406 > 


where the first equality is from direct arithmetic substitution, the second equality uses the fact 
that us(j3g) = 0 whereby (r|) r X/3| = <5||X T r|||oo, the hrst inequality follows by applying Holder’s 


inequality to the last term of the second equality, and the final inequality is an application of (54). 
Item (ii) then follows by taking square roots of the above. 


Item (iii) is essentially just (51). Indeed, since i < k we have: 


i— 1 

bi.< *£ 

3 = 0 


k -1 




= <5 


3=0 


1-(1-S) ; 


< 5. 


(Note that we emphasize the dependence on k rather than i in the above since we have direct 
control over the number of boosting iterations k.) Item (iv) of the theorem is just a restatement of 
the sparsity property of R-FS £i 5 . □ 

A.4.4 Regularized Boosting: Related Work and Context 

As we have already seen, the FS £ algorithm leads to models that have curious similarities with the 
Lasso coefficient profile, but in general the profiles are different. Sufficient conditions under which 
the coefficient profiles of FS £ (for e ~ 0) and Lasso are equivalent have been explored in 27 . A 


related research question is whether there are structurally similar algorithmic variants of FS £ that 
lead to Lasso solutions for arbitrary datasets? In this vein [45j propose BLasso, a corrective 
version of the forward stagewise algorithm. BLASSO, in addition to taking incremental forward 
steps (as in FS £ ), also takes backward steps, the result of which is that the algorithm approximates 
the Lasso coefficient profile under certain assumptions on the data. The authors observe that 
BLasso often leads to models that are sparser and have better predictive accuracy than those 
produced by FS £ . 


In 10 , the authors point out that models delivered by boosting methods need not be adequately 
sparse, and they highlight the importance of obtaining models that have more sparsity, better 
prediction accuracy, and better variable selection properties. They propose a sparse variant of 
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L2-BOOST (see also Section [I]) which considers a regularized version of the squared error loss, 
penalizing the approximate degrees of freedom of the model. 


In 26 , the authors also point out that boosting algorithms often lead to a large collection of 
nonzero coefficients. They suggest reducing the complexity of the model by some form of “post¬ 
processing” technique—one such proposal is to apply a Lasso regularization on the selected set of 
coefficients. 


A parallel line of work in machine learning |14| explores the scope of boosting-like algorithms 
on fi-regularized versions of different loss functions arising mainly in the context of classification 
problems. The proposal of 14 , when adapted to the least squares regression problem with l\- 


regularization penalty, leads to the following optimization problem: 


min 

P 



X/3||l + All/?!!! , 


(55) 


for which the authors |14| employ greedy coordinate descent methods. Like the boosting algorithms 
considered herein, at each iteration the algorithm studied by |14| selects a certain coefficient / 3 j k 
to update, leaving all other coefficients /% unchanged. The amount with which to update the 
coefficient (3j k is determined by fully optimizing the loss function ( [55] ) with respect to (3j k , again 
holding all other coefficients constant (note that one recovers LS-Boost(I) if A = 0). This way 
of updating (3j k leads to a simple soft-thresholding operation [l3j and is structurally different from 
forward stagewise update rules. In contrast, the boosting algorithm R-FS £i ,5 that we propose here 
is based on subgradient descent on the dual of the Lasso problem ([2]), i.e., problem (24). 


A.4.5 Connecting R-FS £ 5 to the Frank-Wolfe method 

Although we developed and analyzed R-FS £i 5 from the perspective of subgradient descent, one can 
also interpret R-FS £] 5 as the Frank-Wolfe algorithm in convex optimization jl6, IT,30| applied to the 
Lasso ([2]). This secondary interpretation can be derived directly from the structure of the updates 
in R-FS £j ,5 or as a special case of a more general primal-dual equivalence between subgradient descent 
and Frank-Wolfe developed in |2|. We choose here to focus on the subgradient descent interpretation 
since it provides a natural unifying framework for a general class of boosting algorithms (including 
FS £ and R-FS £j ^) via a single algorithm applied to a parametric class of objective functions. Other 
authors have commented on the similarities between boosting algorithms and the Frank-Wolfe 
method, see for instance j 111 and [301. 

A.5 Additional Details for Section 5 
A.5.1 Proof of Theorem 15.II 

We first prove the feasibility of j3 k for the Lasso problem with parameter 5k- We do so by induction. 
The feasibility of j3 k is obviously true for k = 0 since /3° = 0 and hence ||/3°||i = 0 < <5o- Now 
suppose it is true for some iteration k, i.e., ||/3 fc ||i < 5k- Then the update for fi k+l in step (3.) of 
algorithm PATH-R-FS £ can be written as /3 fc+1 = (1 — j-)/3 k + -^(<5fcSgn((r fc ) T X ; fc )e,- fc ), from which 

Ok 
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it follows that 


? fe+1 |h = 




- V^ k + ^ (^fc sgn ((^ fc ) T )e Jfc ) 1 1 1 

k h + fr\\hej k \\i < (1 — fr)Sk + f~Sk = h < Sk +1 , 


which completes the induction. 

We now prove the bound on the average training error in part (i). In fact, we will prove something 
stronger than this bound, namely we will prove: 


1 


k + 




n,6i / — 


i =0 


JXM|_ 2£ 

2ne(k + 1) n ’ 


(56) 


from which average training error bound of part (i) follows since Si < 5 for all i. The up¬ 
date rule for the residuals given in Step (3.) of R-FS £)( 5 is: f k+1 •<— — eg k where g k = 

sgn {{f k ) T X jk )X jk T T(r fc — y) . This update rule is precisely in the format of an elementary 
Ok J _ _ 

sequence process, see Appendix A.3.1 and we therefore can apply Proposition A.4 and more 


specifically the inequality (36). Similar in structure to the proof of Theorem 4.1, we first need 
to translate the terms of Proposition A.4 to our setting: once again the variables x are now the 
residuals r, the iterates x l are now the iterates f*, etc. The step-sizes of algorithm PATH-R-FS £ 
are fixed at e, so we have a% = e for all i > 0. Setting the value of x in Proposition A.4 to be 
least-squares residual value, namely x = r^s, and using the exact same logic as in the equations 
(52), one obtains the following result about the left side of (36): 

k 


l 

k + 1 


E(»‘) 


\x — x = 


i=0 


1 

k + 




i =0 


Let us now evaluate the right side of (36). We have ||x° —x|| 2 = ||P° — Pz.S'|| 2 = ||y — (y — X/3 ls) II2 = 
||X/3lsII 2 - Also, it holds that 


h% = ||sgn((f*) T 'XjJXj i — J-(X/T)|| 2 < ||X_ 7 -.|| 2 + ||X(|^)|| 2 < l + i||X|| 1 , 2 | 


< l + ||X||i, 2 < 2 , 


where the third inequality follows since 11,0*111 < S, : from the feasibility of f3 l for the Lasso problem 
with parameter Si proven at the outset, and the second and fourth inequalities follow from the 
assumption that the columns of X have been normalized to have unit £ 2 norm. Therefore G = 2 
is a uniform bound on ||g*|| 2 . Combining the above, inequality (36) implies that after running 
PATH-R-FS £ for k iterations, it holds that: 


1 


k + 


t E“j.(« 


< 


i =0 


I1X/3ls||| 

2(k + l)e 


+ 


2 l e 


I|x/ 3 lsII( 


+ 2e , 


where the inequality is the application of (36). From Proposition 


2e(k + 1) 

we have L n (/T)-L* ^ 


A.6 


o;j.(/3*), which combines with (57) to yield: 


( 57 ) 




k + 1 i=o 5i 


^ i (r„OT - l* , ) < 


1 1 




. ,u, . llmslll , 

Ul < W > £ 2 ne(k + 1) + n ' 
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This proves (56) which then completes the proof of part (i) through the bounds 5i < 5 for all 


Part (ii) is a restatement of the feasibility of f3 k for the Lasso problem with parameter 5^ which 
was proved at the outset, and is re-written to be consistent with the format and for comparison 


with Theorem 4.1 Last of all, part (Hi) follows since at each iteration at most one new coefficient 
is introduced at a non-zero level. □ 


B Additional Details on the Experiments 

We describe here some additional details pertaining to the computational results performed in this 
paper. We first describe in some more detail the real datasets that have been considered in the 
paper. 


Description of datasets considered 

We considered four different publicly available microarray datasets as described below. 


Leukemia dataset This dataset, taken from 112|, has binary response with continuous covariates, 
with 72 samples and approximately 3500 covariates. We further processed the dataset by taking a 
subsample of p = 500 covariates, while retaining all n = 72 sample points. We artificially generated 
the response y via a linear model with the given covariates X (as described in Eg- A in Section [6]). 
The true regression coefficient /3 pop was taken as /3? op = 1 for all i < 10 and zero otherwise. 


Golub dataset The original dataset was taken from the R package mpm, which had 73 samples 
with approximately 5000 covariates. We reduced this to p = 500 covariates (all samples were 
retained). Responses y were generated via a linear model with /3 pop as above. 


Khan dataset This dataset was taken from the dataset webpage http://statweb.Stanford. 
edu/~tibs/ElemStatLearn/datasets/ accompanying the book 128j. The original covariate matrix 
(khan.xtest), which had 73 samples with approximately 5000 covariates, was reduced to p = 500 
covariates (all samples were retained). Responses y were generated via a linear model with /3 pop 
as above. 


Prostate cancer dataset This dataset appears in [15] and is available from the R package LARS. 
The first column lcavol was taken as the response (no artificial response was created here). We 
generated multiple datasets from this dataset, as follows: 

(a) One of the datasets is the original one with n = 97 and p = 8. 

(b) We created another dataset, with n = 97 and p = 44 by enhancing the covariate space to 
include second order interactions. 
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(c) We created another dataset, with to = 10 and p = 44. We subsampled the dataset from (b), 
which again was enhanced to include second order interactions. 

Note that in all the examples above we standardized X such that the columns have unit I 2 norm, 
before running the different algorithms studied herein. 


Leukemia, SNR=1, p=500 


Leukemia, SNR=3, p=500 


Khan, SNR=1, p=500 


LS-BoosT(e) 


LS-BoosT(e) 


LS-BoosT(e) 



FS £ 



FS £ 



FS £ 



Number of Iterations 


Number of Iterations 


Figure 11: Figure showing the training and test errors (in relative scale) as a function of boosting iterations, 
for both LS-BooST(e) (top panel) and FS e (bottom panel). As the number of iterations increases, the 
training error shows a global monotone pattern. The test errors however, initially decrease and then start 
increasing after reaching a minimum. The best test errors obtained are found to be sensitive to the choice 
of e. Two different datasets have been considered: the Leukemia dataset (left and middle panels) and the 
Khan dataset (right panel), as described in Section |6| 


Sensitivity of the Learning Rate in LS-Boost(e) and FS £ We performed several experiments 
running LS-Boost(e) and FS e on an array of real and synthetic datasets, to explore how the 
training and test errors change as a function of the number of boosting iterations and the learning 
rate. Some of the results appear in Figure m The training errors were found to decrease with 
increasing number of boosting iterations. The rate of decay, however, is very sensitive to the value 


55 













































Dataset SNR n LS-Boost(£) FS e FSo Stepwise Lasso 

xlO -2 xlO -2 xlO -2 x10 -2 xlO" 2 



1 

72 

65.9525 

(1.8221) 

66.7713 

(1.8097) 

68.1869 

(1.4971) 

74.5487 

(2.6439) 

68.3471 

(1.584) 

/ 

3 

72 

35.4844 

(1.1973) 

35.5704 

: (0.898) 

35.8385 

(0.7165) 

38.9429 

(1.8030) 

35.3673 

(0.7924) 

sf 

10 

72 

13.5424 

(0.4267) 

13.3690 

(0.3771) 

13.6298 

(0.3945) 

14.8802 

(0.4398) 

13.4929 

(0.4276) 

& 

Sy 

1 

63 

22.3612 

(1.1058) 

22.6185 

(1.0312) 

22.9128 

(1.1209) 

25.2328 

(1.0734) 

23.5145 

(1.2044) 

3 

63 

9.3988 

(0.4856) 

9.4851 

(0.4721) 

9.6571 

(0.3813) 

10.8495 

(0.3627) 

9.2339 

(0.404) 


10 

63 

3.4061 

(0.1272) 

3.4036 

(0.1397) 

3.4812 

(0.1093) 

3.7986 

(0.0914) 

3.1118 

(0.1229) 


1 

50 

53.1406 

(1.5943) 

52.1377 

(1.6559) 

53.6286 

(1.4464) 

60.3266 

(1.9341) 

53.7675 

(1.2415) 

// 

Q 

3 

50 

29.1960 

(1.2555) 

29.2814 

(1.0487) 

30.0654 

(1.0066) 

33.4318 

(0.8780) 

29.8000 

(1.2662) 

< 0 ° 

10 

50 

12.2688 

(0.3359) 

12.0845 

(0.3668) 

12.6034 

(0.5052) 

15.9408 

(0.7939) 

12.4262 

(0.3660) 

// 

1 

50 

74.1228 

(2.1494) 

73.8503 

(2.0983) 

75.0705 

(2.5759) 

92.8779 

(2.7025) 

75.0852 

(2.1039) 


3 

50 

38.1357 

(2.7795) 

40.0003 

(1.8576) 

41.0643 

(1.5503) 

43.9425 

(3.9180) 

41.4932 

(2.2092) 


10 

50 

14.8867 

(0.6994) 

12.9090 

(0.5553) 

15.2174 

(0.7086) 

12.5502 

(0.8256) 

15.0877 

(0.7142) 

Table B.l: 

Table 

> showing the 

prediction errors 

(in percentages) of different methods: LS-Boost(£) 

, FS e 


(both for different values of e), FSo, (forward) Stepwise regression, and LASSO. The numbers within paren¬ 
theses denote standard errors. LS-BooST(e) , FS e are found to exhibit similar statistical performances as 
the LASSO, in fact in some examples the boosting methods seem to be marginally better than LASSO. The 
predictive performance of the models were also found to be sensitive to the choice of the learning rate s. For 
FSo and Stepwise we used the R package LARS 15] to compute the solutions. For all the cases, p = 500. For 
Eg-A, we took n = 50. Both LS-Boost(£) and FS e were run for a few values of £ in the range [0.001 — 0.8] 
in all cases, the optimal models (see the text for details) for LS-Boost(£) and FS e were achieved at a value 
of £ larger than its limiting version e = 0+, thereby suggesting the sensitivity of the best predictive model 
to the learning rate e. 


of e, with smaller values of £ leading to slower convergence behavior to the least squares fit, as 
expected. The test errors were found to decrease and then increase after reaching a minimum; 
furthermore, the best predictive models were found to be sensitive to the choice of e. 


In addition to the above, we also performed a series of experiments on both real and synthetic 
datasets comparing the performance of LS-Boost(£) and FS e to other sparse learning methods, 
namely Lasso , stepwise regression 1151 and FSo 1151. Our results are presented in Table B.l In all 


the cases, we found that the performance of FS e and LS-Boost(£) were at least as good as Lasso. 
And in some cases, the performances of FS e and LS-BooST(e) were superior. The best predictive 
models achieved by LS-BooST(e) and FS e correspond to values of e that are larger than zero or 
even close to one - this suggests that a proper choice of e can lead to superior models. 


Statistical properties of R-FS £ ( 5 , Lasso and FS e : an empirical study We performed some 
experiments to evaluate the performance of R-FS £j 5 , in terms of predictive accuracy and sparsity of 
the optimal model, versus the more widely known methods FS e and Lasso. In all the cases, we took 
a small value of £ = 10 -3 . We ran R-FS £;( 5 on a grid of twenty 5 values, with the limiting solution 
corresponding to the Lasso estimate at the particular value of 5 selected. In all cases, we found 
that when 5 was large, i.e., larger than the best 6 for the Lasso (in terms of obtaining a model with 
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the best predictive performance), R-FS e ^ delivered a model with excellent statistical properties - 
R-FS £; 5 led to sparse solutions (the sparsity was similar to that of the best Lasso model) and the 
predictive performance was as good as, and in some cases better than, the Lasso solution. This 
suggests that the choice of 5 does not play a very crucial role in the R-FS £; <5 algorithm, once it 
is chosen to be reasonably large; indeed the number of boosting iterations play a more important 
role in obtaining good quality statistical estimates. When compared to FS £ (i.e., the version of 
R-FS £; 5 with 6 = oo) we observed that the best models delivered by R-FS £) 5 were more sparse 
(i.e., with fewer non-zeros) than the best FS £ solutions. This complements a popular belief about 
boosting in that it delivers models that are quite dense - see the discussion herein in Section A.4.4 


Furthermore, it shows that the particular form of regularized boosting that we consider, R-FS £) 5 , 


does indeed induce sparser solutions. Our detailed results are presented in Table B.2 


Comments on Table B.l In this experiment, we ran FS £ and LS-BooST(e) for thirty different 
values of e in the range 0.001 to 0.8. The entire regularization paths for the Lasso , FSo, and 
the more aggressive Stepwise regression were computed with the LARS package. First, we observe 
that Stepwise regression, which is quite fast in reaching an unconstrained least squares solution, 
does not perform well in terms of obtaining a model with good predictive performance. The slowly 
learning boosting methods perform quite well - in fact their performances are quite similar to the 
best Lasso solutions. A closer inspection shows that FS £ almost always delivers the best predictive 
models when e is allowed to be flexible. While a good automated method to find the optimal value 
of e is certainly worth investigating, we leave this for future work (of course, there are excellent 
heuristics for choosing the optimal e in practice, such as cross validation, etc.). However, we do 
highlight that in practice a strictly non-zero learning rate £ may lead to better models than its 
limiting version e = 0+. 


For Eg-A (p = 0.8), both LS-BooST(e) and FS £ achieved the best model at e = 10~ 3 . For 
Eg-A (p = 0), LS-BooST(e) achieved the best model at e = 0.1,0.7, 0.8 and FS £ achieved the 
best model at e = 10 3 , 0.7,0.8 (both for SNR values 1, 3, 10 respectively). For the Leukemia 
dataset, LS-BooST(e) achieved the best model at e = 0.6,0.7,0.02 and FS £ achieved the best 
model at e = 0.6,0.02,0.02 (both for SNR values 1, 3, 10 respectively). For the Khan dataset, 
LS-BooST(e) achieved the best model at e = 0.001,0.001, 0.02 and FS £ achieved the best model at 
£ = 0.001,0.02,0.001 (both for SNR values 1, 3, 10 respectively). 
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Method 

n 

P 

Real Data Example: Leukemia 
SNR Test Error Sparsity 

l/3 opt 1 / 1/3* |i 

<V<5max 

fs £ 

72 

500 

1 

0.3431 (0.0087) 

28 

0.2339 

- 

R-FS m 

72 

500 

1 

0.3411 (0.0086) 

25 

0.1829 

0.56 

Lasso 

72 

500 

1 

0.3460 (0.0086) 

30 

1 

0.11 

fs £ 

72 

500 

10 

0.0681 (0.0014) 

67 

0.7116 

_ 

R-FS £i5 

72 

500 

10 

0.0659 (0.0014) 

60 

0.5323 

0.56 

Lasso 

72 

500 

10 

0.0677 (0.0015) 

61 

1 

0.29 




Synthetic Data Examples: Eg-B (SNR=1) 


Method 

n 

P 

P 

Test Error 

Sparsity 

II /3 opt II 1 /11/3* ||i 

V^niax 

fs £ 

50 

500 

0 

0.19001 (0.0057) 

56 

0.9753 

- 

R-FS £)<5 

50 

500 

0 

0.18692 (0.0057) 

51 

0.5386 

0.71 

Lasso 

50 

500 

0 

0.19163 (0.0059) 

47 

1 

0.38 

fs £ 

50 

500 

0.5 

0.20902 (0.0057) 

14 

0.9171 

_ 

R-FS £i 5 

50 

500 

0.5 

0.20636 (0.0055) 

10 

0.1505 

0.46 

Lasso 

50 

500 

0.5 

0.21413 (0.0059) 

13 

1 

0.07 

fs £ 

50 

500 

0.9 

0.05581 (0.0015) 

4 

0.9739 

_ 

R-FS £)< 5 

50 

500 

0.9 

0.05507 (0.0015) 

4 

0.0446 

0.63 

Lasso 

50 

500 

0.9 

0.09137 (0.0025) 

5 

1 

0.04 


Table B.2: Table showing the statistical properties of R-FS £j< 5 as compared to Lasso and FS £ . Both R-FS £i< 5 
and FS e use e = 0.001. The model that achieved the best predictive performance (test-error) corresponds 
to /3 opt . The limiting model (as the number of boosting iterations is taken to be infinitely large) for each 
method is denoted by /3*. “Sparsity” denotes the number of coefficients in /3 opt larger than 10 -5 in absolute 
value. (5 max is the Zi-norm of the least squares solution with minimal £i-norm. Both R-FS £j ,5 and LASSO 
were run for a few 5 values of the form r]5 ma , x , where 77 takes on twenty values in [0.01,0.8]. For the real 
data instances, R-FS e ^ and LASSO were run for a maximum of 30,000 iterations, and FS £ was run for 20,000 
iterations. For the synthetic examples, all methods were run for a maximum of 10,000 iterations. The best 
models for R-FS £j ,5 and FS £ were all obtained in the interior of the path. The best models delivered by 
R.-FS £j 5 are seen to be more sparse and have better predictive performance than the best models obtained by 
FS e . The performances of LASSO and R.-FS £j< 5 are found to be quite similar, though in some cases R-FS £>< 5 
is seen to be at an advantage in terms of better predictive accuracy. 


58 



